The Case Against Batching: Why Large Batches Kill LLM Inference Performance in Production
Site Owner
发布于 2026-04-25
Most LLM inference guides recommend large batches for throughput, but in production — with bursty traffic, P99 latency requirements, and variable sequence lengths — large batches actively hurt performance. This article explains the four mechanisms and offers concrete alternatives.
The Case Against Batching: Why Large Batches Kill LLM Inference Performance in Production
Every LLM inference guide tells you the same thing: increase your batch size. More throughput, better GPU utilization, lower cost per token. And they're not wrong — up to a point.
But somewhere between "works in benchmarks" and "running in production," the advice starts to rot.
I learned this the hard way. After shipping three different LLM-powered products — a coding assistant, a document pipeline, and a real-time chat layer — I kept hitting the same wall: latency that spiked unpredictably, throughput that collapsed under realistic traffic patterns, and GPUs that sat idle while requests queued up. The fix in every case was counterintuitive: run smaller batches, or skip batching entirely.
Here's the technical story of why batching fails in production, and what actually works.
The Theoretical Case For Batching
The argument for batching is mathematically solid. Modern GPUs process matrix multiplications in parallel. A batch of 64 sequences uses roughly the same GPU time as a batch of 1, because the matrix operations are SIMD-parallelized across the batch dimension. This is why training with large micro-batches feels efficient.
For inference, the math gets more nuanced. You have two distinct computational phases:
- Prefill: Processing the input prompt. Highly parallel, benefits from batching.
- Decode: Autoregressive token generation. Sequential by nature, parallelism comes from batching many sequences together.
The argument holds best for the prefill phase. For decode, it's murkier — and in most production LLM workloads, decode dominates. A typical user prompt might be 512 tokens; the response might be another 256. That's roughly 2:1 decode-to-prefill ratio by token count, and decode is where batching's benefits erode fastest.
The Four Reasons Batching Breaks in Production
1. Attention is O(n²) — and Batching Makes It Worse
Flash Attention changed the game for training, but even with Flash Attention 2, the attention computation cost scales quadratically with sequence length and linearly with batch size. The memory footprint for attention is:
Memory = 2 * batch_size * seq_len * num_heads * head_dim * bytes_per_param
With a batch of 32 and sequence length 2048, you're not just doing 32x more attention — you're managing 32 competing attention patterns competing for HBM bandwidth. On H100s, the attention kernel overhead becomes measurable at batch sizes above 16 for long sequences.
The result: decode latency stops scaling linearly with batch size. You'll see 2x batch → 1.7x throughput, then 1.3x, then flat.
2. P99 Latency is What Kills User Experience
Batching optimizes for average throughput. Production SLA are almost never about average latency — they're about P99 or P99.9. When you batch 32 requests together, the 31 fastest requests wait for the slowest one to finish. Your average token latency might look fine; your P99 looks catastrophic.
This is especially brutal for interactive applications (chat, coding assistants) where 500ms feels sluggish and 2s feels broken. A single slow request in a large batch drags down the experience for every concurrent user.
3. Traffic Patterns Are Bursty, Not Uniform
Benchmarking scripts send requests at a steady rate. Real traffic arrives in bursts — a slack notification triggers 200 simultaneous requests, then silence for 30 seconds. If you've sized your batch for steady-state throughput, you'll either:
- Queue the burst (latency spikes)
- Split the burst into multiple smaller batches (complex scheduling, same problem)
- Drop requests (bad)
Adaptive batching helps, but it adds significant engineering complexity. Most teams implement it once, get it wrong, and then abandon batching for the simpler case of dynamic batch sizes with timeout thresholds — which is essentially "don't batch" for any bursty workload.
4. VRAM Fragmentation Kills Effective Utilization
When you process variable-length sequences in a batch, you need to pad to the longest sequence. That padding wastes VRAM. More subtly, the allocation patterns fragment VRAM over time, causing allocation failures even when total memory usage looks healthy.
The CUDA out-of-memory error in LLM inference is frequently not "you're using too much memory" but "your memory is too fragmented to allocate a contiguous block for this sequence." Larger batches amplify fragmentation because they create more variable-sized allocation pressure.
What Actually Works
Approach 1: Continuous Batching (Iteration-Level Scheduling)
Instead of waiting for a full batch to start, continuous batching starts processing sequences as soon as any sequence completes. This is what vLLM implements by default. It dramatically improves GPU utilization under variable-length, bursty traffic.
# Pseudocode for continuous batching concept
while running:
# Try to add waiting requests to current generation batch
for seq in waiting_queue:
if seq fits_in_current_batch():
add_to_batch(seq)
# Run one iteration step on all active sequences
run_generation_step(active_batch)
# Remove completed sequences
for seq in active_batch:
if seq.is_done():
output(seq)
free(seq)
The tradeoff: you need to handle variable-length sequences gracefully, and context switching overhead increases. But for most production workloads, continuous batching with a max batch size of 8-16 outperforms static batching at any size.
Approach 2: Prefix Caching
If you have repeated system prompts or few-shot examples across requests, prefix caching lets you cache the KV states for the shared prefix. The decode phase only recomputes for the unique per-request suffix.
This is essentially free performance — you're not trading off latency for throughput, you're improving both. The constraint is that your serving infrastructure needs to support KV cache lookup by prompt hash.
Approach 3: Speculative Decoding (Carefully)
Speculative decoding uses a small draft model to propose multiple tokens ahead, then verifies them in parallel with the main model. The promise: get 2-4x throughput with no quality degradation.
The catch: it only works when the draft model's acceptance rate is high enough. For highly predictable token sequences (code completion, structured output), acceptance rates exceed 80% and the speedup is real. For open-ended generation, the overhead of running the draft model often exceeds the parallelization benefit.
Don't implement speculative decoding as a general optimization. Implement it specifically for the high-acceptance-rate subset of your traffic.
The Benchmark That Changed My Mind
Here's the test I run now before accepting any "increase your batch size" advice:
# Simulate realistic bursty traffic
for i in {1..100}; do
# 10 requests at once, variable sequence lengths
seq_len=$((1024 + RANDOM % 2048))
curl -X POST http://localhost:8000/generate \
-d "{\"prompt\": \"$(head -c $seq_len /dev/zero | tr '\\0' 'a')\"}" &
done
wait
# Measure P99, not average
python3 -c "
import numpy as np
latencies = np.loadtxt('latencies.txt')
print(f'P50: {np.percentile(latencies, 50):.0f}ms')
print(f'P99: {np.percentile(latencies, 99):.0f}ms')
"
Run this at batch sizes of 1, 4, 8, 16, 32, 64. Plot P99 latency on one axis, batch size on the other. The curve almost always has a minimum somewhere between 4 and 16 for real-world traffic patterns.
The minimum exists because of the tradeoff between:
- GPU utilization (improves with larger batches)
- Queueing delay (worsens with larger batches for P99)
- Memory bandwidth saturation (hits at different points depending on sequence lengths)
When Batching Actually Wins
To be clear: batching isn't always wrong. It wins when:
- Offline batch processing — You don't care about P99 latency. Summarizing 10,000 documents overnight? Batch aggressively.
- Homogeneous sequences — All prompts are the same length, no padding waste.
- High-throughput, latency-tolerant — Video captioning, embedding generation, bulk classification.
- Traffic is uniform — No burstiness, steady request rate.
The mistake is applying batch-inference optimizations to interactive inference workloads. They have fundamentally different optimization targets.
The Bottom Line
The default recommendation to "use larger batches" is cargo-culted from training. Training workloads are batch-friendly by design: uniform sequence lengths, no P99 constraints, offline processing.
Production inference is none of those things. Your users send variable-length prompts at irregular intervals and expect responses in under a second. The optimizations that work for training actively hurt your P99 latency.
Start with batch size 1 or continuous batching with small max sizes. Only increase batch size when you've measured P99 latency and found headroom. The bottleneck is almost never GPU utilization — it's scheduling and memory bandwidth contention.
If you're running LLM inference in production and seeing latency spikes, the first thing to try isn't more GPU — it's smaller batches.