- Home
- AI & Machine Learning
- Throughput vs Latency: Optimizing LLM Inference Speed and Transformer Design
Throughput vs Latency: Optimizing LLM Inference Speed and Transformer Design
In the world of Large Language Models (LLMs), you can't simply "turn up the speed." You are always balancing a tug-of-war between how many tokens the system can pump out for everyone (throughput) and how quickly a single user gets their answer (latency). Understanding this balance is the secret to building an AI application that doesn't frustrate users or bankrupt the company.
The Basics: What Are We Actually Measuring?
Before tweaking settings, we need to be clear on the vocabulary. When people say "latency," they're usually talking about the total time from the moment a user sends a prompt to the moment the full response is finished. But that's too vague. To actually fix speed issues, we break it down into two main parts:- Time To First Token (TTFT): This is the gap between hitting 'Send' and seeing the very first character. It's essentially the "reaction time" of the model. If TTFT is high, the app feels laggy.
- Time Per Output Token (TPOT): Once the model starts talking, how long does it take to generate each subsequent word? This determines the "reading speed" of the AI.
The Batch Size Dilemma: The Primary Lever
If you want to change how an LLM performs, the first knob you turn is the batch size. Batching is the process of grouping multiple user requests together and processing them in one go.When you use a small batch (say, 1 or 2 requests), the GPU finishes the work quickly. Latency is low. However, GPUs are monsters of parallel computation; they are designed to do thousands of things at once. Processing a tiny batch is like using a semi-truck to deliver a single envelope-it's a massive waste of resources.
On the flip side, if you crank the batch size up to 64, you maximize the GPU's efficiency. You're now delivering a whole truckload of envelopes. But here's the catch: the GPU now has more work to do per cycle, which pushes the latency up. In some tests on NVIDIA A100 GPUs, increasing the batch size to 64 can boost throughput by 14x, but it can also make the response time 4x slower.
| Batch Size | Hardware Utilization | Latency (TTFT/TPOT) | Throughput (Tokens/Sec) | Best Use Case |
|---|---|---|---|---|
| Small (1-4) | Low (Underutilized) | Very Low (Fast) | Low | Real-time Chatbots, Code Editors |
| Medium (8-32) | Moderate | Balanced | Moderate | General Purpose API |
| Large (64+) | High (Saturated) | High (Slower) | Very High | Offline Batch Processing, Summarization |
Prefill vs. Decode: A Tale of Two Phases
To understand why the Transformer architecture behaves this way, we have to look at the two stages of inference: the prefill phase and the decode phase.During the Prefill Phase, the model reads your entire prompt. Since the prompt contains many tokens, the GPU has plenty of work to do immediately. It can saturate its compute cores even with a single request. This means batching doesn't actually help throughput much during prefill; it just adds more work to the pile.
The Decode Phase is where the magic (and the bottleneck) happens. The model generates tokens one by one. To produce a single token, the GPU has to load the entire model's weights from memory. Loading weights is slow compared to actually calculating the math. If you only generate one token for one user, you're spending most of your time waiting for the memory to move, not calculating.
By increasing the batch size during the decode phase, you load those weights once but use them to generate tokens for 64 users at the same time. This amortizes the memory cost, causing throughput to climb almost linearly until the GPU finally hits its computational limit.
Scheduling and Memory Management
How the system decides which request to handle next-the scheduling policy-can drastically change the user experience. Older systems used "request-level batching," where the server wouldn't start a new set of requests until the previous batch was completely finished. This created a "stutter" where new users had to wait for everyone else's long responses to finish before their prompt was even read.Modern systems use "iteration-level batching." A great example is vLLM, which utilizes PagedAttention. This technique manages the KV (key-value) cache-the model's "short-term memory" of the conversation-more like a computer's virtual memory. Instead of allocating a giant, contiguous block of memory for every request, it breaks it into pages. This allows the system to pack many more requests into the same GPU, driving throughput up by an order of magnitude.
However, even with these tools, there is a conflict. If a scheduler prioritizes "prefills" (new requests), it makes the app feel snappy to start, but it can cause the "decodes" (ongoing responses) to stutter. If it prioritizes decodes, existing users get a smooth stream of text, but new users experience a long delay before their first token appears.
The Cost of Tensor Parallelism
When a model is too big for one GPU (like Llama2-70B), we use Tensor Parallelism (TP). This splits the model's layers across multiple GPUs. On the surface, this sounds like a win: more GPUs mean more memory and more compute, which should lower latency.But there's a hidden tax: communication. Every time the model finishes a calculation in a layer, the GPUs have to talk to each other to sync their results using "all-reduce" operations. This happens twice per transformer layer.
Here is the problem: the amount of data the GPUs need to swap doesn't actually shrink just because you added more GPUs. As you increase the number of GPUs in a TP group, the time spent calculating goes down, but the time spent communicating stays relatively flat or increases. Eventually, you hit a point of diminishing returns where you're adding hardware, but the GPUs are spending more time "chatting" than "thinking." This degrades the overall throughput efficiency and can actually make your cost-per-token skyrocket.
Matching Design to the Use Case
There is no such thing as the "fastest" configuration-only the one that fits your goal.If you are building a coding assistant or a customer service chatbot, you are operating in what engineers call "Quadrant II": low TTFT and low throughput. You care about a feeling of instant responsiveness. In this scenario, you should use smaller batch sizes and perhaps over-provision your hardware to ensure users never wait more than a few hundred milliseconds for that first token.
If you are building a bulk document summarizer or an automated content generator, you are optimizing for throughput. You don't care if the first token takes 5 seconds to appear, as long as the system processes 10,000 documents per hour. Here, you should push your batch sizes to the limit and use aggressive iteration-level batching to maximize GPU utilization.
The Future: Breaking the Tradeoff
We are starting to see a shift from general-purpose batching to specialized scheduling. Research like Sarathi-Serve is trying to tame this tradeoff by optimizing how prefills and decodes are interleaved. By being smarter about when to insert a new request into the pipeline, Sarathi-Serve has shown the ability to improve throughput by up to 6.9x for massive models like Falcon-180B while still meeting strict latency targets.Similarly, new architectures like Ulysses are rethinking how distributed GPUs communicate to reduce the communication overhead of tensor parallelism. We are moving toward a world where the "Pareto frontier"-the boundary of what is possible-is shifting, allowing us to get both high speed and high volume.
Why does a large batch size increase latency?
A larger batch size means the GPU has to perform more calculations for each step before it can move to the next. While it processes more tokens overall (higher throughput), the time it takes to complete one full cycle for the entire group increases, meaning each individual user waits longer for their specific token to be generated.
What is the difference between TTFT and TPOT?
TTFT (Time To First Token) is the initial delay before the model starts responding, which is heavily influenced by the prefill phase. TPOT (Time Per Output Token) is the speed at which the model generates subsequent tokens, which is dominated by the decode phase and memory bandwidth.
How does PagedAttention help with throughput?
PagedAttention prevents memory fragmentation in the KV cache. By storing the conversation history in non-contiguous "pages" rather than one giant block, the system can fit significantly more requests into the GPU memory at once, allowing for larger batches and higher throughput.
Does adding more GPUs always reduce latency?
Not necessarily. While Tensor Parallelism splits the workload to reduce compute time, it introduces communication overhead. If the time spent on "all-reduce" operations (syncing GPUs) outweighs the time saved on computation, adding more GPUs can actually lead to diminishing returns or even slower performance.
Which metric is most important for a real-time chatbot?
For interactive applications, TTFT is the most critical metric because it directly impacts the perceived responsiveness of the AI. Users are much more likely to perceive an app as "broken" or "slow" if there is a long pause before the first token appears.
Susannah Greenwood
I'm a technical writer and AI content strategist based in Asheville, where I translate complex machine learning research into clear, useful stories for product teams and curious readers. I also consult on responsible AI guidelines and produce a weekly newsletter on practical AI workflows.
About
EHGA is the Education Hub for Generative AI, offering clear guides, tutorials, and curated resources for learners and professionals. Explore ethical frameworks, governance insights, and best practices for responsible AI development and deployment. Stay updated with research summaries, tool reviews, and project-based learning paths. Build practical skills in prompt engineering, model evaluation, and MLOps for generative AI.