Heuristic:Run llama Llama index Worker Count Configuration

Knowledge Sources	LlamaIndex Core Multiprocessing analysis
Domains	Optimization, Infrastructure
Last Updated	2026-02-11 19:00 GMT

Overview

Worker count configuration for parallel ingestion pipeline execution and async job concurrency, with CPU-aware capping and safe subprocess spawning.

Description

LlamaIndex uses two distinct parallelism mechanisms: multiprocessing (for synchronous `IngestionPipeline.run()`) and asyncio semaphores (for async operations via `run_jobs()`). The ingestion pipeline caps `num_workers` at the system CPU count and uses the `spawn` multiprocessing context for safety. The async `run_jobs()` utility defaults to 4 concurrent workers via semaphore.

Usage

Apply this heuristic when:

Processing large document collections through the IngestionPipeline and wanting to parallelize
Tuning async concurrency for batch evaluation or embedding generation
Running on systems with limited CPU cores

The Insight (Rule of Thumb)

Action (Ingestion): Set `num_workers` in `IngestionPipeline.run(num_workers=N)` for multiprocessing parallelism.
Value: Should not exceed CPU count. LlamaIndex auto-caps and warns if you try.
Action (Async): Default async concurrency is 4 workers. Override via `num_workers` on embedding models or `workers` on `BatchEvalRunner`.
Batch Eval Workers: Default is 2 (more conservative than general async default of 4).
Trade-off: More workers = faster processing but higher CPU/memory usage and risk of API rate limiting.

Reasoning

CPU Capping: The ingestion pipeline explicitly checks `multiprocessing.cpu_count()` and warns when `num_workers` exceeds it. This prevents oversubscription which causes context switching overhead and actually slows down processing.

Spawn Context: The code uses `multiprocessing.get_context("spawn")` instead of the default `fork`. This is critical because `fork` is unsafe with multithreaded programs (common in async Python code) and can cause deadlocks on macOS.

Conservative Eval Workers: `BatchEvalRunner` defaults to only 2 workers because evaluation involves LLM API calls with rate limits. Each worker makes independent API requests, so too many concurrent workers can trigger rate limiting.

Code evidence from `ingestion/pipeline.py:542-551`:

if num_workers and num_workers > 1:
    num_cpus = multiprocessing.cpu_count()
    if num_workers > num_cpus:
        warnings.warn(
            "Specified num_workers exceed number of CPUs in the system. "
            "Setting `num_workers` down to the maximum CPU count."
        )
        num_workers = num_cpus

    with multiprocessing.get_context("spawn").Pool(num_workers) as p:

Async default from `async_utils.py:132`:

DEFAULT_NUM_WORKERS = 4

Semaphore pattern from `async_utils.py:158`:

semaphore = asyncio.Semaphore(workers)

Batch eval default from `evaluation/batch_runner.py:90`:

workers: int = 2,

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment