Heuristic:Marker Inc Korea AutoRAG Batch Size Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, RAG, Rate_Limiting |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Batch size selection rules for AutoRAG modules, varying from 3 (OpenAI LLM) to 128 (local embeddings) depending on whether the target is a rate-limited API or local inference.
Description
AutoRAG uses different batch sizes across its modules based on whether the operation is API-bound or compute-bound. API-based LLM calls use small batches (3-16) to avoid rate limiting. Local model operations (embedding, reranking) use large batches (64-128) to maximize GPU throughput. The default batch size of 16 for OpenAI generators has been found to cause rate limit errors at lower API tiers, with the documentation recommending batch sizes of 3 or fewer for OpenAI models.
Usage
Apply this heuristic when configuring batch sizes in AutoRAG YAML pipeline configurations. If you encounter rate limit errors from LLM APIs, reduce the batch size. If local model inference is slow, increase the batch size up to your VRAM capacity.
The Insight (Rule of Thumb)
- Action: Set `batch` parameter differently based on the module type.
- Values:
- OpenAI LLM generation: `batch=3` or lower (documented recommendation)
- Other API-based LLMs: `batch=16` (default)
- Local rerankers (FlagEmbedding): `batch=64` (default)
- Local embeddings (vLLM, sentence-transformers): `batch=128` (default)
- Passage augmenter embeddings: `batch=128` (default)
- Evaluation metrics (deepeval): `batch=16` (default)
- Trade-off: Lower batch sizes avoid rate limits but increase total execution time. Higher batch sizes improve throughput but may exceed API rate limits or GPU memory.
Reasoning
The batch size defaults are tuned empirically based on production experience:
API-bound operations: OpenAI and other LLM APIs enforce rate limits (tokens per minute, requests per minute). The AutoRAG documentation explicitly states that batch sizes above 3 cause rate limit errors on standard OpenAI tiers. The default of 16 works for higher-tier accounts but is too aggressive for most users.
Compute-bound operations: Local model inference (rerankers, embeddings) benefits from batching because GPU operations are parallelized. Batch sizes of 64-128 are typical for models that fit in VRAM. The FlagEmbedding reranker defaults to 64 (smaller models), while vLLM embeddings default to 128 (larger sequences batched efficiently).
Evaluation operations: Metric computation (e.g., deepeval faithfulness) calls the LLM API, so it uses the same 16-batch default as generators.
Code Evidence
OpenAI generator default batch from `autorag/nodes/generator/openai_llm.py:72`:
def __init__(self, project_dir, llm: str, batch: int = 16, *args, **kwargs):
FlagEmbedding reranker batch from `autorag/nodes/passagereranker/flag_embedding.py:50`:
batch = kwargs.pop("batch", 64)
vLLM embedding batch from `autorag/embedding/vllm.py:61`:
embed_batch_size: int = 128,
Troubleshooting documentation recommendation from `docs/source/troubleshooting.md:137-141`:
We recommend setting batch under 3 when you are using openai model.
In our experiment, it occurred rate limit error when the batch size was 4.