Heuristic:Run llama Llama index Embedding Batch Size Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, RAG, Embedding_Finetuning |
| Last Updated | 2026-02-11 19:00 GMT |
Overview
Embedding batch size configuration for balancing throughput, API rate limits, and memory usage when generating embeddings in LlamaIndex.
Description
LlamaIndex batches text inputs before sending them to embedding models. The `embed_batch_size` parameter (default: 10) controls how many texts are sent per API call or processed per batch locally. The `num_workers` parameter (default: None, meaning sequential) controls async concurrency. These parameters significantly affect indexing speed, API cost, and memory usage.
Usage
Apply this heuristic when:
- Building a VectorStoreIndex from large document collections
- Running the IngestionPipeline with embedding transformations
- Observing slow indexing times or API rate limit errors
The Insight (Rule of Thumb)
- Action: Set `embed_batch_size` on the embedding model based on provider limits.
- Value: Default is 10. OpenAI supports up to 2048. Local models depend on GPU memory.
- Hard Cap: Maximum allowed is 2048 (enforced by validation: `le=2048`).
- Async Workers: Set `num_workers` for parallel async batches. Default async concurrency is 4 workers via `run_jobs()`.
- Trade-off: Larger batches = fewer API calls but higher per-request latency and memory. Too large may trigger rate limits.
Reasoning
The conservative default of 10 was chosen because:
API Safety: OpenAI's embedding API accepts up to 2048 texts per request, but sending large batches increases the risk of timeout and rate-limiting. Starting conservative avoids unexpected failures.
Memory: For local embedding models, each batch must fit in GPU memory. A batch of 10 short texts is safe for most GPU configurations.
Async Pattern: When `num_workers` is set, LlamaIndex uses a semaphore-based pattern to limit concurrent API calls, preventing overwhelming the embedding provider.
Code evidence from `base/embeddings/base.py:81-93`:
embed_batch_size: int = Field(
default=DEFAULT_EMBED_BATCH_SIZE,
description="The batch size for embedding calls.",
gt=0,
le=2048,
)
num_workers: Optional[int] = Field(
default=None,
description="The number of workers to use for async embedding calls.",
)
Default from `constants.py:8`:
DEFAULT_EMBED_BATCH_SIZE = 10