Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Run llama Llama index Embedding Batch Size Tuning

From Leeroopedia
Knowledge Sources
Domains Optimization, RAG, Embedding_Finetuning
Last Updated 2026-02-11 19:00 GMT

Overview

Embedding batch size configuration for balancing throughput, API rate limits, and memory usage when generating embeddings in LlamaIndex.

Description

LlamaIndex batches text inputs before sending them to embedding models. The `embed_batch_size` parameter (default: 10) controls how many texts are sent per API call or processed per batch locally. The `num_workers` parameter (default: None, meaning sequential) controls async concurrency. These parameters significantly affect indexing speed, API cost, and memory usage.

Usage

Apply this heuristic when:

  • Building a VectorStoreIndex from large document collections
  • Running the IngestionPipeline with embedding transformations
  • Observing slow indexing times or API rate limit errors

The Insight (Rule of Thumb)

  • Action: Set `embed_batch_size` on the embedding model based on provider limits.
  • Value: Default is 10. OpenAI supports up to 2048. Local models depend on GPU memory.
  • Hard Cap: Maximum allowed is 2048 (enforced by validation: `le=2048`).
  • Async Workers: Set `num_workers` for parallel async batches. Default async concurrency is 4 workers via `run_jobs()`.
  • Trade-off: Larger batches = fewer API calls but higher per-request latency and memory. Too large may trigger rate limits.

Reasoning

The conservative default of 10 was chosen because:

API Safety: OpenAI's embedding API accepts up to 2048 texts per request, but sending large batches increases the risk of timeout and rate-limiting. Starting conservative avoids unexpected failures.

Memory: For local embedding models, each batch must fit in GPU memory. A batch of 10 short texts is safe for most GPU configurations.

Async Pattern: When `num_workers` is set, LlamaIndex uses a semaphore-based pattern to limit concurrent API calls, preventing overwhelming the embedding provider.

Code evidence from `base/embeddings/base.py:81-93`:

embed_batch_size: int = Field(
    default=DEFAULT_EMBED_BATCH_SIZE,
    description="The batch size for embedding calls.",
    gt=0,
    le=2048,
)
num_workers: Optional[int] = Field(
    default=None,
    description="The number of workers to use for async embedding calls.",
)

Default from `constants.py:8`:

DEFAULT_EMBED_BATCH_SIZE = 10

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment