- Home
- AI & Machine Learning
- Continuous Batching and KV Caching: Maximizing LLM Throughput
Continuous Batching and KV Caching: Maximizing LLM Throughput
If you've ever deployed a Large Language Model (LLM), you know the pain of watching your expensive GPU sit idle while it waits for a single long response to finish. It's a waste of money and compute. The core problem is that LLMs generate text one token at a time, and if you use old-school batching, your fastest requests are held hostage by your slowest ones. But there is a way to fix this. By combining Continuous Batching is an iteration-level scheduling technique that allows new requests to enter a batch as soon as others finish, rather than waiting for the entire group to complete and KV Caching, you can push your throughput from a trickle to a flood.
The Memory Trick: How KV Caching Stops Redundant Work
To understand why we need these optimizations, we first have to look at how the Transformer architecture actually works. Every time an LLM predicts the next word, it doesn't just look at the last word; it looks at every single word that came before it. Without a cache, the model would have to re-calculate the mathematical representations (the Keys and Values) for every previous token in the sequence, every single time a new token is generated. This creates a computational nightmare where the cost grows quadratically as the sentence gets longer.
KV Caching solves this by storing those calculated Key (K) and Value (V) vectors in GPU memory. Instead of re-computing the past, the model simply looks up the cached values. This shifts the computational cost from $O(n^2)$ down to $O(n)$, which is a massive win for speed. However, this comes with a trade-off: memory. For every token stored, the system uses $2 \times L \times A \times H$ of memory (where $L$ is layers, $A$ is head dimension, and $H$ is the number of heads). In a production environment, this memory pressure is the primary bottleneck that can lead to crashes or slow response times if not managed correctly.
Static Batching vs. Continuous Batching
In the early days of LLM deployment, we used static batching. Imagine a bus that only leaves the station once every seat is filled, and it won't let anyone off until every single passenger has reached their final destination. If one person is going to the next town (a short 50-token response) and another is going across the country (a 2,000-token response), the person going to the next town just has to sit on the bus for the entire trip. In GPU terms, this means your processing cores sit idle, wasting cycles while waiting for the longest sequence in the batch to finish.
Continuous Batching (also known as in-flight batching) changes the rules. It operates at the token level. As soon as a request hits its end-of-sequence token, it is evicted from the batch, and a new request is slotted in immediately. There is no "waiting for the batch to finish." This keeps the GPU saturated and ensures that throughput stays high regardless of how much the output lengths vary between users.
| Feature | Static Batching | Continuous Batching |
|---|---|---|
| Scheduling Level | Request Level | Token/Iteration Level |
| GPU Utilization | Low (waits for longest sequence) | High (immediate replacement) |
| Throughput | Baseline | Up to 20x improvement |
| Memory Management | Fixed pre-allocation | Dynamic/Paged allocation |
Solving Fragmentation with PagedAttention
Even with continuous batching, we hit a wall: memory fragmentation. Standard memory allocation requires contiguous blocks of space. If the system reserves space for 2,048 tokens but the user only writes a 10-token prompt, that extra space is wasted. This "internal fragmentation" means you can't fit as many requests into a batch as you theoretically should.
Enter PagedAttention. This technique, popularized by the vLLM library, treats GPU memory like a virtual operating system treats RAM. It breaks the KV cache into fixed-size pages. These pages don't need to be next to each other in memory, allowing the system to allocate slots on demand. By eliminating the need to reserve massive, contiguous blocks of memory in advance, PagedAttention allows for much larger batch sizes and significantly higher throughput.
Advanced Optimizations: Chunked Prefill and Prefix Sharing
When a new request enters a batch, it goes through a "prefill" phase where the model processes the initial prompt. If the prompt is huge, this prefill step can take a long time and hog all the GPU resources, causing a spike in latency for other requests currently in the decoding phase. Chunked Prefill solves this by breaking the initial prompt into smaller pieces, interleaving them with the decoding steps of other requests. This prevents the "stutter" that users feel when a massive prompt enters the system.
Another clever trick is the use of prefix trees. In many real-world apps, users send prompts that start with the same instructions (e.g., "You are a helpful assistant that writes in the style of..."). Instead of calculating the KV cache for that same instruction for every single user, the system uses hashing to identify identical prefixes and stores them in a single shared block of memory. This is called prefix sharing, and it drastically reduces the memory footprint for common prompts.
The Real-World Performance Impact
Does this actually work in production? The data says yes. Implementations like TensorRT-LLM and vLLM have shown that moving from static to continuous batching can result in 10-20x throughput gains. Some independent benchmarks from Anyscale have even reported improvements up to 23x. For a company running a cluster of H100s, this isn't just a technical win-it's a massive cost saving.
Recent research, such as the BatchLLM paper from April 2024, has pushed this further by prioritizing requests with larger decoding-to-prefill ratios. By reordering requests and using memory-centric batching (batching by memory usage rather than just the number of requests), these systems can maximize the number of tokens processed per second even under extreme memory pressure.
Choosing the Right Infrastructure
If you're building your own inference stack, you have a choice. You can use open-source tools like vLLM, which give you a great balance of PagedAttention and continuous batching out of the box. Or, you can go for deeper hardware integration with TensorRT-LLM for maximum NVIDIA-specific performance. If you use a managed provider, you should ask them specifically how they handle KV cache management and whether they support chunked prefill. If they don't, you'll likely see higher Time To First Token (TTFT) as your prompt lengths increase.
What is the main difference between static and continuous batching?
Static batching waits for all requests in a group to finish before starting a new group, meaning a short request is delayed by a long one. Continuous batching allows requests to enter and leave the batch at any iteration, ensuring the GPU is always working on the next available token.
Does KV caching slow down the model?
No, it significantly speeds it up by eliminating the need to re-compute previous tokens. The only "downside" is that it consumes GPU memory, which can limit the total number of concurrent requests you can handle.
How does PagedAttention reduce memory waste?
PagedAttention stops the system from reserving a large, contiguous block of memory for the maximum possible sequence length. Instead, it allocates small, non-contiguous pages of memory only when needed, which drastically reduces internal fragmentation.
What is Time To First Token (TTFT) and how does it relate to batching?
TTFT is the time it takes for the model to generate the very first token after a prompt is submitted. Heavy prefill loads in a batch can increase TTFT. Techniques like chunked prefill help smooth this out so users don't experience long pauses before the text starts appearing.
Can I use continuous batching with any LLM?
Continuous batching is a property of the inference server (like vLLM or TensorRT-LLM), not the model itself. As long as the model uses a Transformer-based architecture (which almost all modern LLMs do), it can benefit from these serving optimizations.
Next Steps for Optimization
If you're seeing high latency but low GPU utilization, your first move should be implementing a serving framework that supports PagedAttention and continuous batching. Start by benchmarking your typical request lengths; if your users provide long prompts but short answers, prioritize chunked prefill. If you have a set of static "system prompts" that every user shares, implement prefix caching to save memory. Finally, monitor your Tokens Per Second (TPS) and adjust your batch size limits until you find the sweet spot where throughput is maximized without spiking the TTFT beyond your acceptable limit.
Susannah Greenwood
I'm a technical writer and AI content strategist based in Asheville, where I translate complex machine learning research into clear, useful stories for product teams and curious readers. I also consult on responsible AI guidelines and produce a weekly newsletter on practical AI workflows.
About
EHGA is the Education Hub for Generative AI, offering clear guides, tutorials, and curated resources for learners and professionals. Explore ethical frameworks, governance insights, and best practices for responsible AI development and deployment. Stay updated with research summaries, tool reviews, and project-based learning paths. Build practical skills in prompt engineering, model evaluation, and MLOps for generative AI.