Continuous Batching and KV Caching: Maximizing LLM Throughput

Home
AI & Machine Learning
Continuous Batching and KV Caching: Maximizing LLM Throughput

Susannah Greenwood 23 April 2026 10 Comments

Continuous Batching and KV Caching: Maximizing LLM Throughput

If you've ever deployed a Large Language Model (LLM), you know the pain of watching your expensive GPU sit idle while it waits for a single long response to finish. It's a waste of money and compute. The core problem is that LLMs generate text one token at a time, and if you use old-school batching, your fastest requests are held hostage by your slowest ones. But there is a way to fix this. By combining Continuous Batching is an iteration-level scheduling technique that allows new requests to enter a batch as soon as others finish, rather than waiting for the entire group to complete and KV Caching, you can push your throughput from a trickle to a flood.

The Memory Trick: How KV Caching Stops Redundant Work

To understand why we need these optimizations, we first have to look at how the Transformer architecture actually works. Every time an LLM predicts the next word, it doesn't just look at the last word; it looks at every single word that came before it. Without a cache, the model would have to re-calculate the mathematical representations (the Keys and Values) for every previous token in the sequence, every single time a new token is generated. This creates a computational nightmare where the cost grows quadratically as the sentence gets longer.

KV Caching solves this by storing those calculated Key (K) and Value (V) vectors in GPU memory. Instead of re-computing the past, the model simply looks up the cached values. This shifts the computational cost from $O(n^2)$ down to $O(n)$, which is a massive win for speed. However, this comes with a trade-off: memory. For every token stored, the system uses $2 \times L \times A \times H$ of memory (where $L$ is layers, $A$ is head dimension, and $H$ is the number of heads). In a production environment, this memory pressure is the primary bottleneck that can lead to crashes or slow response times if not managed correctly.

Static Batching vs. Continuous Batching

In the early days of LLM deployment, we used static batching. Imagine a bus that only leaves the station once every seat is filled, and it won't let anyone off until every single passenger has reached their final destination. If one person is going to the next town (a short 50-token response) and another is going across the country (a 2,000-token response), the person going to the next town just has to sit on the bus for the entire trip. In GPU terms, this means your processing cores sit idle, wasting cycles while waiting for the longest sequence in the batch to finish.

Continuous Batching (also known as in-flight batching) changes the rules. It operates at the token level. As soon as a request hits its end-of-sequence token, it is evicted from the batch, and a new request is slotted in immediately. There is no "waiting for the batch to finish." This keeps the GPU saturated and ensures that throughput stays high regardless of how much the output lengths vary between users.

Comparison: Static vs. Continuous Batching
Feature	Static Batching	Continuous Batching
Scheduling Level	Request Level	Token/Iteration Level
GPU Utilization	Low (waits for longest sequence)	High (immediate replacement)
Throughput	Baseline	Up to 20x improvement
Memory Management	Fixed pre-allocation	Dynamic/Paged allocation

Solving Fragmentation with PagedAttention

Even with continuous batching, we hit a wall: memory fragmentation. Standard memory allocation requires contiguous blocks of space. If the system reserves space for 2,048 tokens but the user only writes a 10-token prompt, that extra space is wasted. This "internal fragmentation" means you can't fit as many requests into a batch as you theoretically should.

Enter PagedAttention. This technique, popularized by the vLLM library, treats GPU memory like a virtual operating system treats RAM. It breaks the KV cache into fixed-size pages. These pages don't need to be next to each other in memory, allowing the system to allocate slots on demand. By eliminating the need to reserve massive, contiguous blocks of memory in advance, PagedAttention allows for much larger batch sizes and significantly higher throughput.

Advanced Optimizations: Chunked Prefill and Prefix Sharing

When a new request enters a batch, it goes through a "prefill" phase where the model processes the initial prompt. If the prompt is huge, this prefill step can take a long time and hog all the GPU resources, causing a spike in latency for other requests currently in the decoding phase. Chunked Prefill solves this by breaking the initial prompt into smaller pieces, interleaving them with the decoding steps of other requests. This prevents the "stutter" that users feel when a massive prompt enters the system.

Another clever trick is the use of prefix trees. In many real-world apps, users send prompts that start with the same instructions (e.g., "You are a helpful assistant that writes in the style of..."). Instead of calculating the KV cache for that same instruction for every single user, the system uses hashing to identify identical prefixes and stores them in a single shared block of memory. This is called prefix sharing, and it drastically reduces the memory footprint for common prompts.

The Real-World Performance Impact

Does this actually work in production? The data says yes. Implementations like TensorRT-LLM and vLLM have shown that moving from static to continuous batching can result in 10-20x throughput gains. Some independent benchmarks from Anyscale have even reported improvements up to 23x. For a company running a cluster of H100s, this isn't just a technical win-it's a massive cost saving.

Recent research, such as the BatchLLM paper from April 2024, has pushed this further by prioritizing requests with larger decoding-to-prefill ratios. By reordering requests and using memory-centric batching (batching by memory usage rather than just the number of requests), these systems can maximize the number of tokens processed per second even under extreme memory pressure.

Choosing the Right Infrastructure

If you're building your own inference stack, you have a choice. You can use open-source tools like vLLM, which give you a great balance of PagedAttention and continuous batching out of the box. Or, you can go for deeper hardware integration with TensorRT-LLM for maximum NVIDIA-specific performance. If you use a managed provider, you should ask them specifically how they handle KV cache management and whether they support chunked prefill. If they don't, you'll likely see higher Time To First Token (TTFT) as your prompt lengths increase.

What is the main difference between static and continuous batching?

Static batching waits for all requests in a group to finish before starting a new group, meaning a short request is delayed by a long one. Continuous batching allows requests to enter and leave the batch at any iteration, ensuring the GPU is always working on the next available token.

Does KV caching slow down the model?

No, it significantly speeds it up by eliminating the need to re-compute previous tokens. The only "downside" is that it consumes GPU memory, which can limit the total number of concurrent requests you can handle.

How does PagedAttention reduce memory waste?

PagedAttention stops the system from reserving a large, contiguous block of memory for the maximum possible sequence length. Instead, it allocates small, non-contiguous pages of memory only when needed, which drastically reduces internal fragmentation.

What is Time To First Token (TTFT) and how does it relate to batching?

TTFT is the time it takes for the model to generate the very first token after a prompt is submitted. Heavy prefill loads in a batch can increase TTFT. Techniques like chunked prefill help smooth this out so users don't experience long pauses before the text starts appearing.

Can I use continuous batching with any LLM?

Continuous batching is a property of the inference server (like vLLM or TensorRT-LLM), not the model itself. As long as the model uses a Transformer-based architecture (which almost all modern LLMs do), it can benefit from these serving optimizations.

Next Steps for Optimization

If you're seeing high latency but low GPU utilization, your first move should be implementing a serving framework that supports PagedAttention and continuous batching. Start by benchmarking your typical request lengths; if your users provide long prompts but short answers, prioritize chunked prefill. If you have a set of static "system prompts" that every user shares, implement prefix caching to save memory. Finally, monitor your Tokens Per Second (TPS) and adjust your batch size limits until you find the sweet spot where throughput is maximized without spiking the TTFT beyond your acceptable limit.

Susannah Greenwood

I'm a technical writer and AI content strategist based in Asheville, where I translate complex machine learning research into clear, useful stories for product teams and curious readers. I also consult on responsible AI guidelines and produce a weekly newsletter on practical AI workflows.

How to Reduce LLM Latency: A Guide to Streaming, Batching, and Caching

Continuous Batching and KV Caching: Maximizing LLM Throughput

10 Comments

John Fox

April 24, 2026 AT 18:44 PM

vllm is basically a cheat code for anyone running local models
Christina Morgan

April 25, 2026 AT 08:25 AM

This is a fantastic breakdown of some really complex concepts. It's so helpful to see the distinction between the prefill and decoding phases explained so clearly, as that's usually where people get confused when they first start optimizing their inference stacks.
Jim Sonntag

April 26, 2026 AT 08:54 AM

wow 20x improvement... totally believable... i'm sure it works exactly like that in the real world without any caveats lol
Deepak Sungra

April 27, 2026 AT 18:30 PM

Honestly, the way this is written is way too optimistic. I tried setting up vLLM last week and the memory overhead was just a nightmare, it felt like my GPU was fighting for its life the whole time. Absolute chaos.
Kate Tran

April 29, 2026 AT 14:10 PM

just installed vllm and it feels way smoother than what i had befor. though my vram is still kinda struggleing a bit
Samar Omar

April 29, 2026 AT 22:10 PM

While the technical merit of PagedAttention is undeniably profound, one cannot help but observe that the industry's obsession with raw throughput often comes at the expense of an elegant architectural purity, reducing the sublime act of language generation to a mere exercise in memory paging and fragmented buffer management which is, frankly, quite pedestrian if you think about it deeply enough.
Anuj Kumar

May 1, 2026 AT 13:07 PM

The numbers are fake. These companies just want you to buy more H100s so they can control the AI. They hide the real bottlenecks to make you think these software tricks are magic when it's all just a game.
chioma okwara

May 2, 2026 AT 05:46 AM

You forgot to mention that the efficiency of prefix sharing is highly dependent on the hash function used, otherwise you get collisions and the whole thing just falls apart. Also, "too」 is not the same as "to" but I see you used the correct one here for once.
Tasha Hernandez

May 2, 2026 AT 13:52 PM

Oh look, another "revolutionary" way to squeeze more tokens out of a piece of silicon. I'm sure the corporate overlords are just thrilled that we're spending our lives optimizing how a machine mimics a human, while the actually interesting parts of the architecture are ignored for the sake of "throughput." Truly a golden age of engineering.
amber hopman

May 3, 2026 AT 23:06 PM

I've been experimenting with different serving frameworks and the difference in TTFT is actually pretty noticeable when using chunked prefill. It's a game changer for long-form content generation where the prompt is several thousand tokens long. I wonder if there's a way to dynamically adjust the chunk size based on the current queue length to further optimize the user experience without killing the overall throughput of the system. It seems like a logical next step for those of us trying to build truly scalable apps.

Write a comment

Name *

Email *

Website

Comments

EHGA is the Education Hub for Generative AI, offering clear guides, tutorials, and curated resources for learners and professionals. Explore ethical frameworks, governance insights, and best practices for responsible AI development and deployment. Stay updated with research summaries, tool reviews, and project-based learning paths. Build practical skills in prompt engineering, model evaluation, and MLOps for generative AI.

Continuous Batching and KV Caching: Maximizing LLM Throughput

The Memory Trick: How KV Caching Stops Redundant Work

Static Batching vs. Continuous Batching

Solving Fragmentation with PagedAttention

Advanced Optimizations: Chunked Prefill and Prefix Sharing

The Real-World Performance Impact

Choosing the Right Infrastructure