Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Run llama Llama index Worker Count Configuration

From Leeroopedia
Knowledge Sources
Domains Optimization, Infrastructure
Last Updated 2026-02-11 19:00 GMT

Overview

Worker count configuration for parallel ingestion pipeline execution and async job concurrency, with CPU-aware capping and safe subprocess spawning.

Description

LlamaIndex uses two distinct parallelism mechanisms: multiprocessing (for synchronous `IngestionPipeline.run()`) and asyncio semaphores (for async operations via `run_jobs()`). The ingestion pipeline caps `num_workers` at the system CPU count and uses the `spawn` multiprocessing context for safety. The async `run_jobs()` utility defaults to 4 concurrent workers via semaphore.

Usage

Apply this heuristic when:

  • Processing large document collections through the IngestionPipeline and wanting to parallelize
  • Tuning async concurrency for batch evaluation or embedding generation
  • Running on systems with limited CPU cores

The Insight (Rule of Thumb)

  • Action (Ingestion): Set `num_workers` in `IngestionPipeline.run(num_workers=N)` for multiprocessing parallelism.
  • Value: Should not exceed CPU count. LlamaIndex auto-caps and warns if you try.
  • Action (Async): Default async concurrency is 4 workers. Override via `num_workers` on embedding models or `workers` on `BatchEvalRunner`.
  • Batch Eval Workers: Default is 2 (more conservative than general async default of 4).
  • Trade-off: More workers = faster processing but higher CPU/memory usage and risk of API rate limiting.

Reasoning

CPU Capping: The ingestion pipeline explicitly checks `multiprocessing.cpu_count()` and warns when `num_workers` exceeds it. This prevents oversubscription which causes context switching overhead and actually slows down processing.

Spawn Context: The code uses `multiprocessing.get_context("spawn")` instead of the default `fork`. This is critical because `fork` is unsafe with multithreaded programs (common in async Python code) and can cause deadlocks on macOS.

Conservative Eval Workers: `BatchEvalRunner` defaults to only 2 workers because evaluation involves LLM API calls with rate limits. Each worker makes independent API requests, so too many concurrent workers can trigger rate limiting.

Code evidence from `ingestion/pipeline.py:542-551`:

if num_workers and num_workers > 1:
    num_cpus = multiprocessing.cpu_count()
    if num_workers > num_cpus:
        warnings.warn(
            "Specified num_workers exceed number of CPUs in the system. "
            "Setting `num_workers` down to the maximum CPU count."
        )
        num_workers = num_cpus

    with multiprocessing.get_context("spawn").Pool(num_workers) as p:

Async default from `async_utils.py:132`:

DEFAULT_NUM_WORKERS = 4

Semaphore pattern from `async_utils.py:158`:

semaphore = asyncio.Semaphore(workers)

Batch eval default from `evaluation/batch_runner.py:90`:

workers: int = 2,

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment