Heuristic:Marker Inc Korea AutoRAG Batch Size Tuning

Knowledge Sources	AutoRAG AutoRAG Troubleshooting
Domains	Optimization, RAG, Rate_Limiting
Last Updated	2026-02-12 00:00 GMT

Overview

Batch size selection rules for AutoRAG modules, varying from 3 (OpenAI LLM) to 128 (local embeddings) depending on whether the target is a rate-limited API or local inference.

Description

AutoRAG uses different batch sizes across its modules based on whether the operation is API-bound or compute-bound. API-based LLM calls use small batches (3-16) to avoid rate limiting. Local model operations (embedding, reranking) use large batches (64-128) to maximize GPU throughput. The default batch size of 16 for OpenAI generators has been found to cause rate limit errors at lower API tiers, with the documentation recommending batch sizes of 3 or fewer for OpenAI models.

Usage

Apply this heuristic when configuring batch sizes in AutoRAG YAML pipeline configurations. If you encounter rate limit errors from LLM APIs, reduce the batch size. If local model inference is slow, increase the batch size up to your VRAM capacity.

The Insight (Rule of Thumb)

Action: Set `batch` parameter differently based on the module type.
Values:
- OpenAI LLM generation: `batch=3` or lower (documented recommendation)
- Other API-based LLMs: `batch=16` (default)
- Local rerankers (FlagEmbedding): `batch=64` (default)
- Local embeddings (vLLM, sentence-transformers): `batch=128` (default)
- Passage augmenter embeddings: `batch=128` (default)
- Evaluation metrics (deepeval): `batch=16` (default)
Trade-off: Lower batch sizes avoid rate limits but increase total execution time. Higher batch sizes improve throughput but may exceed API rate limits or GPU memory.

Reasoning

The batch size defaults are tuned empirically based on production experience:

API-bound operations: OpenAI and other LLM APIs enforce rate limits (tokens per minute, requests per minute). The AutoRAG documentation explicitly states that batch sizes above 3 cause rate limit errors on standard OpenAI tiers. The default of 16 works for higher-tier accounts but is too aggressive for most users.

Compute-bound operations: Local model inference (rerankers, embeddings) benefits from batching because GPU operations are parallelized. Batch sizes of 64-128 are typical for models that fit in VRAM. The FlagEmbedding reranker defaults to 64 (smaller models), while vLLM embeddings default to 128 (larger sequences batched efficiently).

Evaluation operations: Metric computation (e.g., deepeval faithfulness) calls the LLM API, so it uses the same 16-batch default as generators.

Code Evidence

OpenAI generator default batch from `autorag/nodes/generator/openai_llm.py:72`:

def __init__(self, project_dir, llm: str, batch: int = 16, *args, **kwargs):

FlagEmbedding reranker batch from `autorag/nodes/passagereranker/flag_embedding.py:50`:

batch = kwargs.pop("batch", 64)

vLLM embedding batch from `autorag/embedding/vllm.py:61`:

embed_batch_size: int = 128,

Troubleshooting documentation recommendation from `docs/source/troubleshooting.md:137-141`:

We recommend setting batch under 3 when you are using openai model.
In our experiment, it occurred rate limit error when the batch size was 4.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment