Heuristic:Infiniflow Ragflow Embedding Batch Size Constraint
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Optimization |
| Last Updated | 2026-02-12 06:00 GMT |
Overview
Embedding batch size is capped at 16 across all providers (constrained by OpenAI API limits), with provider-specific token truncation limits ranging from 500 to 30,000 tokens per input.
Description
RAGFlow processes embeddings in batches of 16 texts at a time — a universal constraint driven by the OpenAI API's batch size limit. This batch size is used consistently across all embedding providers (OpenAI, DashScope, LocalAI, Cohere, HuggingFace, etc.) regardless of whether the provider supports larger batches. Each provider also has its own maximum token length for input truncation: OpenAI at 8,191 tokens, DashScope at 2,048, BAAI/bge-small at 500, and Qwen3-Embedding at 30,000. The `EMBEDDING_BATCH_SIZE` setting (default 16) controls how many chunks are sent to the embedding model per API call during document ingestion.
Usage
Use this heuristic when tuning document processing throughput. The batch size of 16 is a safe default; increasing it may cause API errors with some providers. The per-provider token truncation ensures inputs are not silently rejected.
The Insight (Rule of Thumb)
- Action: Keep `EMBEDDING_BATCH_SIZE=16` (default). Do not exceed this for OpenAI-compatible APIs.
- Value: batch_size=16 for API calls, truncation varies: 8191 (OpenAI), 2048 (DashScope), 500 (BGE-small), 30000 (Qwen3).
- Trade-off: Smaller batch sizes increase API call count but reduce memory usage and risk of timeout. Larger batches are faster but may hit provider limits.
Reasoning
The OpenAI embedding API enforces a batch size limit of 16. Since RAGFlow supports switching between many embedding providers, using the most restrictive limit as the universal default ensures compatibility. The token truncation limits are provider-specific and documented in the MAX_TOKENS dictionary. For local models (via LocalAI), token counting may not work correctly, so RAGFlow falls back to reporting 1024 tokens as a conservative estimate.
Code Evidence from `rag/llm/embedding_model.py:55-74`:
class BuiltinEmbed(Base):
MAX_TOKENS = {
"Qwen/Qwen3-Embedding-0.6B": 30000,
"BAAI/bge-m3": 8000,
"BAAI/bge-small-en-v1.5": 500
}
def encode(self, texts: list):
batch_size = 16
Global batch size setting from `common/settings.py:123`:
EMBEDDING_BATCH_SIZE: int = 16
OpenAI truncation from `rag/llm/embedding_model.py:101-103`:
batch_size = 16
texts = [truncate(t, 8191) for t in texts]