- Home
- AI & Machine Learning
- LLM Inference Observability: Tracking Token Metrics, Queues, and Tail Latency
LLM Inference Observability: Tracking Token Metrics, Queues, and Tail Latency
When your Large Language Model (LLM) starts lagging, looking at "requests per second" won't save you. That metric lies. It can look perfectly stable while your system is actually choking on a single massive prompt, leaving other users staring at frozen screens. To keep your AI services fast and cost-effective, you need to stop treating LLMs like standard web servers. You need LLM observability, which focuses on token-level metrics, queue dynamics, and tail latency.
As of 2026, running LLMs in production is less about raw compute power and more about managing the unpredictable nature of text generation. A request for a short email summary takes milliseconds; a request for a complex code refactor can take seconds and consume thousands of tokens. This variability breaks traditional monitoring tools. If you aren't tracking how tokens move through your system, you're flying blind.
Why Token Metrics Matter More Than Request Counts
The core unit of work in an LLM isn't the HTTP request-it's the token. When you monitor only request volume, you miss the actual load on your GPU. One user might send a 10-token prompt, while another sends a 10,000-token context window. Both count as "one request," but the second one consumes hundreds of times more resources.
You need to track three specific token metrics to understand your system's health:
- Prompt Tokens: The input data sent to the model. High counts here increase the time-to-first-token (TTFT).
- Completion Tokens: The output generated by the model. This drives the duration of the streaming response.
- Total Token Throughput: The aggregate number of tokens processed per second. This is your true measure of system capacity.
If your requests-per-second graph is flat but your token throughput drops, your system is saturated. Tools like vLLM and Text Generation Inference (TGI) export these metrics explicitly. For example, TGI provides tgi_request_generated_tokens, while OpenTelemetry standards use gen_ai.client.token.usage. Without these granular counters, you cannot distinguish between a traffic spike and a resource bottleneck.
Decoding Latency: TTFT vs. Inter-Token Delay
Latency in LLMs isn't a single number. It’s a two-part experience for the user. First, they wait for the first word to appear. Then, they wait for the rest of the sentence to stream in. These two phases have different causes and require different fixes.
Time-to-First-Token (TTFT) is the initial delay before any output appears. This is heavily influenced by the length of the input prompt. Research shows that for every additional input token, the P95 TTFT increases by approximately 0.24ms. If your TTFT jumps from milliseconds to several seconds, users perceive the system as broken or frozen. This is often the first sign of queue saturation. Monitor this using histograms like vllm:time_to_first_token_seconds or gen_ai.server.time_to_first_token.
Inter-Token Latency determines how fluid the response feels. Even if the total request completes successfully, high inter-token latency makes the text appear choppy and fragmented. This metric reflects the model's decoding speed and GPU memory bandwidth. Track this with tgi_request_mean_time_per_token_duration. If this spikes, your model is struggling to generate the next word efficiently, likely due to memory constraints or inefficient batching.
The Hidden Danger of Tail Latency
Average latency is a vanity metric. Nobody cares if your average response time is 200ms if the 99th percentile (P99) is 5 seconds. In LLM systems, tail latency defines the actual user experience. It represents the worst-case scenarios where users abandon their sessions.
Tail latency is driven by "heavy-tailed" distributions in token length. A small percentage of requests ask for extremely long outputs. These outliers block the GPU, causing subsequent requests to queue up. LogicMonitor and Kong both emphasize that you must monitor percentile distributions-P50, P95, and P99-to catch these issues. If your P99 latency is terrible, your average looks fine, but your users are unhappy.
To manage tail latency, you need to identify the source of the delay. Is it the model inference itself? Prompt processing? External tool calls? Break down end-to-end latency into components. Use tgi_request_duration to measure the full journey, but drill down into sub-components to find the bottleneck.
Queue Dynamics and Batching Strategies
LLM inference engines don't process requests one by one. They batch them to maximize GPU utilization. This creates a queuing dynamic that directly impacts latency. When a large request enters the queue, it delays all smaller requests behind it.
Queueing theory reveals a critical trade-off. Enforcing a maximum output token limit can significantly reduce queuing delay. By cutting off excessively long generations, you free up the GPU for other users. However, set the limit too low, and you degrade response quality. Set it too high, and you increase waiting times, causing impatient users to leave.
Monitor these queue indicators closely:
- Queue Size: The number of pending requests. A growing queue indicates saturation.
- Batch Size: The number of requests processed together. Larger batches improve throughput but can increase latency for individual requests.
- Queue Wait Time: How long a request sits before execution begins. This reveals delays caused by waiting for available replicas.
Tools like TGI expose both queue size and batch size as first-class indicators. Use these to tune your continuous batching configuration. If your queue wait time spikes during peak hours, consider increasing replica count or adjusting max batch sizes to prioritize responsiveness over raw throughput.
| Metric | What It Measures | Impact on User Experience | Common Tool Export |
|---|---|---|---|
| Time-to-First-Token (TTFT) | Initial latency before first output | Perceived system responsiveness | vllm:time_to_first_token_seconds |
| Inter-Token Latency | Time between subsequent tokens | Fluidity of streamed text | tgi_request_mean_time_per_token_duration |
| Token Throughput | Total tokens processed per second | System capacity and cost efficiency | gen_ai.client.token.usage |
| Queue Wait Time | Time spent waiting for GPU | Overall request delay | Custom Prometheus gauge |
| P99 Latency | Worst-case response time | User abandonment rate | tgi_request_duration histogram |
Implementing Effective Observability
Building an observability stack for LLMs requires integrating multiple data sources. Start with standardized instrumentation. Use OpenTelemetry semantic conventions for GenAI to ensure compatibility across tools. This includes metrics like gen_ai.server.time_per_output_token and gen_ai.server.time_to_first_token.
Next, connect your inference engine (like vLLM or TGI) to a metrics collector such as Prometheus. Ensure you are exporting histograms, not just averages. Histograms allow you to calculate percentiles and visualize the distribution of latency. Without histograms, you cannot detect tail latency issues.
Finally, establish alerts based on token budgets and latency thresholds. Different models have vastly different cost profiles. Monitoring cost per interaction helps prevent runaway expenses. Set alerts for anomalous spend patterns and SLO compliance violations. If your P95 TTFT exceeds 500ms, trigger an investigation. If token usage spikes unexpectedly, check for inefficient prompts or malicious inputs.
Observability isn't optional. It's the foundation of reliable LLM deployment. By focusing on token metrics, queue behavior, and tail latency, you gain the visibility needed to optimize performance, control costs, and deliver a seamless user experience.
Why is "requests per second" a bad metric for LLMs?
Requests per second ignores the variable workload of each request. A simple question and a complex analysis both count as one request, but they consume vastly different amounts of GPU resources and time. Token throughput is a more accurate measure of system load.
What is the difference between TTFT and inter-token latency?
TTFT (Time-to-First-Token) is the delay before the first word appears, primarily affected by input prompt length. Inter-token latency is the time between subsequent words, affecting how smooth the streaming response feels. Both are critical for user experience but stem from different bottlenecks.
How does tail latency impact user retention?
Tail latency refers to the slowest responses (e.g., P99). Users rarely notice average performance but immediately react to occasional extreme delays. High tail latency causes users to perceive the system as unreliable, leading to abandonment and lost trust.
What role do queues play in LLM inference?
Queues manage the flow of requests to the GPU. Because LLMs batch requests for efficiency, a large request can delay many smaller ones. Monitoring queue size and wait time helps identify saturation and optimize batching strategies to balance throughput and latency.
Which tools support LLM observability metrics?
Popular inference engines like vLLM and Text Generation Inference (TGI) export detailed metrics. Standardized frameworks like OpenTelemetry provide semantic conventions for GenAI. Monitoring platforms like Prometheus, Grafana, and specialized AI observability tools (e.g., BentoML, Freeplay.ai) help visualize and alert on these metrics.
Susannah Greenwood
I'm a technical writer and AI content strategist based in Asheville, where I translate complex machine learning research into clear, useful stories for product teams and curious readers. I also consult on responsible AI guidelines and produce a weekly newsletter on practical AI workflows.
About
EHGA is the Education Hub for Generative AI, offering clear guides, tutorials, and curated resources for learners and professionals. Explore ethical frameworks, governance insights, and best practices for responsible AI development and deployment. Stay updated with research summaries, tool reviews, and project-based learning paths. Build practical skills in prompt engineering, model evaluation, and MLOps for generative AI.