- Home
- AI & Machine Learning
- Choosing Batch Sizes to Minimize Cost per Token in LLM Serving
Choosing Batch Sizes to Minimize Cost per Token in LLM Serving
Every time your application calls a large language model, you are paying for more than just the text it generates. You are paying for idle silicon, wasted memory bandwidth, and computational overhead that vanishes into thin air if you aren't careful. The difference between a profitable AI product and a money-losing one often comes down to a single configuration variable: batch size.
If you are running inference on GPUs, whether self-hosted or via cloud APIs, treating every user request as an isolated event is like driving a bus with only one passenger. It works, but it is incredibly expensive per mile. By grouping requests together, you force the GPU to do meaningful work instead of waiting around. This guide breaks down exactly how to choose the right batch size to slash your cost per token without breaking your latency SLAs.
The Economics of Batching: Why Small Batches Bleed Money
To understand why batch sizing matters, you have to look at what happens inside the GPU during inference. Large Language Models (LLMs) operate in two distinct phases: prefilling and decoding.
In the prefill phase, the model processes the entire input prompt. This is compute-bound, meaning the GPU’s processing cores are working at full capacity. In the decoding phase, the model generates tokens one by one. This phase is memory-bandwidth bound. The GPU spends most of its time moving data from VRAM to the processor rather than doing math.
When you process requests individually (batch size = 1), the GPU sits idle while waiting for data to move. It is like a chef chopping vegetables (compute) and then standing still while waiting for the oven to preheat (memory transfer). If you batch 32 requests together, the chef chops all 32 batches of vegetables before checking the oven. The compute units stay busy, and the memory transfers happen in parallel.
The financial impact is stark. Research from Koombea AI indicates that proper batching can reduce API overhead costs by up to 90%. For enterprise deployments, this isn't just about saving pennies; it is about viability. As noted by Nitesh Singhal, serving a 70B parameter model on A100 GPUs can cost $2,000-$3,000 per day. Without batching, those costs skyrocket because you need more GPUs to handle the same throughput.
Finding Your Sweet Spot: Optimal Batch Sizes by Task
There is no universal "best" batch size. The optimal number depends entirely on what your model is doing. Different tasks have different latency tolerances and computational profiles. Here is how to map your workload to the right range:
- Text Generation (Creative Writing, Storytelling): Aim for batch sizes of 10-50. These tasks require longer output sequences, which increases the time each request stays in the batch. Smaller batches prevent excessive tail latency.
- Classification & Tagging: Aim for batch sizes of 100-500. Since the output is short (often just one token or a few words), the decoding phase is negligible. You can pack these tightly to maximize throughput.
- Q&A Systems (Chatbots, Support Bots): Aim for batch sizes of 50-200. These sit in the middle ground. Users expect quick responses, but the outputs are usually concise enough to allow moderate batching.
Benchmark studies provide concrete evidence of these ranges. For a 7B parameter model, latency drops from 976 ms at batch size 1 to 126 ms at batch size 8. However, diminishing returns kick in hard after batch size 64. At that point, the GPU runs out of memory bandwidth headroom, and throughput plateaus. Pushing beyond this limit without upgrading hardware leads to out-of-memory errors, not speed gains.
Static vs. Dynamic vs. Continuous Batching
How you group requests is just as important as how many you group. There are three main strategies, each with trade-offs:
| Strategy | Best For | Latency Impact | Throughput Gain |
|---|---|---|---|
| Static Batching | Predictable workloads (e.g., nightly document processing) | High (waits for full batch) | Moderate |
| Dynamic Batching | Variable traffic (e.g., customer support chats) | Medium (groups available requests) | High |
| Continuous Batching | Real-time interactive apps | Low (inserts new requests as others finish) | Very High (up to 23x improvement) |
Continuous batching is the current gold standard for production systems. Unlike static batching, which waits for a fixed number of requests to arrive, continuous batching dynamically inserts new sequences into the GPU memory as soon as other sequences finish generating. This keeps the GPU utilization near 100% even when request lengths vary wildly. Tools like vLLM and TensorRT-LLM implement this natively, achieving up to 24x higher throughput than standard Hugging Face implementations.
Hardware Constraints: The Memory Wall
You cannot simply increase batch size infinitely. The limiting factor is always GPU VRAM. Every active request consumes memory for two things: the model weights (static) and the Key-Value (KV) cache (dynamic).
The KV cache stores the attention states for every token generated so far. As a conversation gets longer, the KV cache grows. If you have a batch size of 32, and each request has a context window of 4,000 tokens, you are holding 128,000 tokens' worth of state in memory. Exceed this limit, and the system crashes.
This creates a "tug-of-war" described by Databricks engineers: larger batch sizes increase throughput but also increase the KV cache size, which may require adding more GPUs. Adding more GPUs increases cost. The goal is to find the maximum batch size that fits within your existing hardware budget.
Recent research published on arXiv in February 2025 highlights that selecting the right GPU type interacts heavily with batch sizing. Consumer-grade GPUs often offer superior memory bandwidth per unit price compared to enterprise cards like the A100 or H100. For smaller models, using multiple consumer GPUs with optimized batching can be more cost-efficient than a single high-end card.
Advanced Tactics: Compounding Savings
Batching is powerful, but it works best when combined with other optimization techniques. Here are three strategies to layer on top of your batch size tuning:
- Model Cascading: Don't send every query to your most expensive model. Route 90% of simple queries to a smaller, cheaper model like Mistral 7B (costing ~$0.00006 per 300 tokens) and escalate only complex tasks to premium models like GPT-4. When you batch these smaller model requests, the cost savings compound dramatically-up to 87% reduction in total spend.
- Early Stopping: Configure your generation parameters to halt token production once a satisfactory completion is reached. This can reduce output tokens by 20-40%, shrinking the decoding phase and allowing you to fit more requests into each batch cycle.
- Caching with RAG: Use Retrieval-Augmented Generation (RAG) to cache responses to repetitive queries. If a user asks the same question twice, serve the cached answer instantly. This reduces the load on your inference engine, allowing you to dedicate more batch slots to unique, complex queries.
Implementation Roadmap: From Theory to Practice
Implementing effective batching is not a set-and-forget task. It requires monitoring and adjustment. Here is a practical step-by-step approach for engineering teams:
- Baseline Measurement: Run your current workload with batch size = 1. Record the average latency, p95 latency, and cost per token. This is your control group.
- Incremental Testing: Increase the batch size in steps of 4 or 8. After each increase, measure the change in latency and throughput. Stop increasing when you see a sharp spike in p95 latency or when GPU memory usage exceeds 85%.
- Profile Sequence Lengths: Analyze your historical data. What is the average input length? What is the average output length? If your inputs are highly variable, static batching will hurt your latency. Switch to dynamic or continuous batching libraries like vLLM.
- Monitor KV Cache Pressure: Use tools like NVIDIA Nsight Systems to monitor memory bandwidth utilization. If you see bottlenecks in memory transfer rather than compute, you have hit the ceiling for your current batch size.
- Automate Adjustments: For production systems, consider implementing auto-scaling policies that adjust batch sizes based on real-time queue depth. If the queue is empty, lower the batch size to reduce latency. If the queue is long, increase the batch size to clear the backlog.
Expect a learning curve. One fintech engineering lead reported spending three weeks tuning their batch size before finding the optimal setting of 35 for their support ticket classifier. The result was a 58% cost reduction. Another team reduced their monthly OpenAI bill from $18,500 to $9,200 by locking in a batch size of 22 for document summarization. The key is patience and rigorous measurement.
Future Trends: Automated Optimization
The industry is moving toward automated batch management. Emerging tools use mixed-integer linear programming to schedule requests dynamically, achieving up to 41% higher throughput within the same price budget. Anthropic and other major providers are integrating these features directly into their inference engines, promising less manual tuning for developers.
However, understanding the fundamentals remains critical. Even with automation, you need to know why a certain batch size is chosen to debug issues when they arise. The physics of GPU utilization won't change overnight. Memory bandwidth will remain the bottleneck, and batching will remain the primary lever for pulling value out of that hardware.
What is the ideal batch size for LLM inference?
There is no single ideal size, but general guidelines suggest 10-50 for text generation, 50-200 for Q&A, and 100-500 for classification. The optimal size is determined by your GPU's memory capacity and your application's latency requirements. Always test incrementally to find the point where throughput maximizes without causing out-of-memory errors or unacceptable latency spikes.
Does batching increase latency?
Yes, batching typically increases the latency for individual requests because they must wait for the batch to fill or for the GPU to process other requests in the batch. However, it significantly improves overall throughput (tokens per second). For real-time applications, continuous batching mitigates this by inserting new requests as soon as space becomes available, keeping latency low while maintaining high efficiency.
How does batch size affect GPU memory usage?
Batch size directly impacts the Key-Value (KV) cache memory consumption. Each active request in a batch stores its attention states in VRAM. Larger batch sizes mean more concurrent requests, which exponentially increases memory pressure. If the KV cache exceeds available VRAM, the system will crash. Therefore, the maximum feasible batch size is strictly limited by your GPU's memory capacity and the sequence length of your inputs.
What is continuous batching and why is it better?
Continuous batching is an advanced technique that dynamically manages request queues. Instead of waiting for a fixed batch to complete, it allows new requests to join the GPU computation as soon as previous requests finish generating tokens. This keeps the GPU fully utilized even with variable-length requests, resulting in up to 23x higher throughput and lower p50 latency compared to traditional static batching methods.
Can I use batching with OpenAI's API?
Yes, OpenAI offers a specific Batch API designed for non-interactive, high-volume tasks. While you don't manually configure the batch size on the server side, you submit a file containing multiple requests, and OpenAI processes them efficiently over a 24-hour period. This method can reduce costs by up to 50% compared to standard real-time API calls, making it ideal for background jobs like document analysis or data labeling.
Susannah Greenwood
I'm a technical writer and AI content strategist based in Asheville, where I translate complex machine learning research into clear, useful stories for product teams and curious readers. I also consult on responsible AI guidelines and produce a weekly newsletter on practical AI workflows.
Popular Articles
About
EHGA is the Education Hub for Generative AI, offering clear guides, tutorials, and curated resources for learners and professionals. Explore ethical frameworks, governance insights, and best practices for responsible AI development and deployment. Stay updated with research summaries, tool reviews, and project-based learning paths. Build practical skills in prompt engineering, model evaluation, and MLOps for generative AI.