Heuristic:FlagOpen FlagEmbedding Dynamic Batch Size Reduction
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Inference |
| Last Updated | 2026-02-09 21:00 GMT |
Overview
Automatic batch size reduction strategy that catches CUDA OOM errors and retries with 75% of the previous batch size.
Description
All FlagEmbedding inference classes (embedders and rerankers) implement a defensive batch sizing strategy. Before the main encoding loop begins, the system attempts a trial run with the requested batch size. If a `RuntimeError` or `torch.cuda.OutOfMemoryError` occurs, the batch size is reduced to 75% of its current value (`batch_size * 3 // 4`) and the trial is retried. This loop continues until a successful batch completes. The reduced batch size is then used for all subsequent batches in the encoding loop.
Usage
This heuristic is automatically applied during all inference calls. It activates when the initial batch size causes an OOM error, which is common when:
- Running large models (decoder-only 7B+) with high batch sizes
- Input sequences are long and consume more GPU memory than expected
- GPU is partially occupied by other processes
No user configuration is needed; the mechanism is built into the inference pipeline.
The Insight (Rule of Thumb)
- Action: On OOM, reduce batch size to `batch_size * 3 // 4` (75% of current) and retry.
- Value: Each reduction step removes 25% of the batch. Multiple reductions compound (e.g., 256 -> 192 -> 144 -> 108).
- Trade-off: Smaller batch sizes increase total inference time due to more iterations, but prevent OOM crashes.
- Scope: Applied in all 8 inference classes: encoder-only base embedder, decoder-only base embedder, ICL embedder, M3 embedder, encoder-only reranker, decoder-only base/layerwise/lightweight rerankers.
Reasoning
GPU memory consumption during inference is highly dependent on sequence length, which varies across batches. A static batch size chosen for short sequences may cause OOM on a batch of long sequences. The 75% reduction factor is a practical balance: aggressive enough to free meaningful memory, but conservative enough to avoid excessive batch fragmentation. The trial-and-error approach on the first batch means the final batch size adapts to the actual memory constraints of the specific hardware and input distribution.
The pattern is applied consistently across all inference implementations:
# From FlagEmbedding/inference/embedder/encoder_only/base.py:226-242
# sort by length for less padding
length_sorted_idx = np.argsort([-len(x['input_ids']) for x in all_inputs])
all_inputs_sorted = [all_inputs[i] for i in length_sorted_idx]
# adjust batch size
flag = False
while flag is False:
try:
inputs_batch = self.tokenizer.pad(
all_inputs_sorted[: batch_size],
padding=True,
return_tensors='pt',
**kwargs
).to(device)
last_hidden_state = self.model(**inputs_batch, return_dict=True).last_hidden_state
embeddings = self.pooling(last_hidden_state, inputs_batch['attention_mask'])
flag = True
except RuntimeError as e:
batch_size = batch_size * 3 // 4
except torch.cuda.OutOfMemoryError as e:
batch_size = batch_size * 3 // 4