Heuristic:Run llama Llama index Worker Count Configuration
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Infrastructure |
| Last Updated | 2026-02-11 19:00 GMT |
Overview
Worker count configuration for parallel ingestion pipeline execution and async job concurrency, with CPU-aware capping and safe subprocess spawning.
Description
LlamaIndex uses two distinct parallelism mechanisms: multiprocessing (for synchronous `IngestionPipeline.run()`) and asyncio semaphores (for async operations via `run_jobs()`). The ingestion pipeline caps `num_workers` at the system CPU count and uses the `spawn` multiprocessing context for safety. The async `run_jobs()` utility defaults to 4 concurrent workers via semaphore.
Usage
Apply this heuristic when:
- Processing large document collections through the IngestionPipeline and wanting to parallelize
- Tuning async concurrency for batch evaluation or embedding generation
- Running on systems with limited CPU cores
The Insight (Rule of Thumb)
- Action (Ingestion): Set `num_workers` in `IngestionPipeline.run(num_workers=N)` for multiprocessing parallelism.
- Value: Should not exceed CPU count. LlamaIndex auto-caps and warns if you try.
- Action (Async): Default async concurrency is 4 workers. Override via `num_workers` on embedding models or `workers` on `BatchEvalRunner`.
- Batch Eval Workers: Default is 2 (more conservative than general async default of 4).
- Trade-off: More workers = faster processing but higher CPU/memory usage and risk of API rate limiting.
Reasoning
CPU Capping: The ingestion pipeline explicitly checks `multiprocessing.cpu_count()` and warns when `num_workers` exceeds it. This prevents oversubscription which causes context switching overhead and actually slows down processing.
Spawn Context: The code uses `multiprocessing.get_context("spawn")` instead of the default `fork`. This is critical because `fork` is unsafe with multithreaded programs (common in async Python code) and can cause deadlocks on macOS.
Conservative Eval Workers: `BatchEvalRunner` defaults to only 2 workers because evaluation involves LLM API calls with rate limits. Each worker makes independent API requests, so too many concurrent workers can trigger rate limiting.
Code evidence from `ingestion/pipeline.py:542-551`:
if num_workers and num_workers > 1:
num_cpus = multiprocessing.cpu_count()
if num_workers > num_cpus:
warnings.warn(
"Specified num_workers exceed number of CPUs in the system. "
"Setting `num_workers` down to the maximum CPU count."
)
num_workers = num_cpus
with multiprocessing.get_context("spawn").Pool(num_workers) as p:
Async default from `async_utils.py:132`:
DEFAULT_NUM_WORKERS = 4
Semaphore pattern from `async_utils.py:158`:
semaphore = asyncio.Semaphore(workers)
Batch eval default from `evaluation/batch_runner.py:90`:
workers: int = 2,