Heuristic:Bentoml BentoML Thread Env Vars Setting

Knowledge Sources	BentoML
Domains	Optimization, ML_Serving
Last Updated	2026-02-13 16:00 GMT

Overview

BentoML automatically sets 9 thread-control environment variables (OMP, MKL, OpenBLAS, etc.) for each runner worker to prevent CPU over-subscription in multi-worker deployments.

Description

When BentoML spawns runner workers, it automatically sets thread-limiting environment variables for common numerical computing libraries. For multi-threading runners, the thread count is set to `math.ceil(cpus)` for all variables. For non-multi-threading runners (single-threaded per worker), all variables are set to `"1"`. This prevents the common problem where each worker spawns its own thread pool, leading to massive CPU over-subscription (e.g., 8 workers each spawning 8 threads on an 8-core machine = 64 threads competing for 8 cores).

Usage

Use this heuristic when debugging CPU utilization issues in BentoML deployments. Apply when CPU usage is unexpectedly high, when you see performance degradation with multiple workers, or when you need to understand why BentoML overrides your thread settings. Also relevant when using the `BENTOML_NUM_THREAD` environment variable for custom runner code.

The Insight (Rule of Thumb)

Action: BentoML auto-sets these 9 environment variables per worker:
- `BENTOML_NUM_THREAD` (custom runner code)
- `OMP_NUM_THREADS` (OpenMP)
- `OPENBLAS_NUM_THREADS` (OpenBLAS)
- `MKL_NUM_THREADS` (Intel MKL)
- `VECLIB_MAXIMUM_THREADS` (Apple Accelerate)
- `NUMEXPR_NUM_THREADS` (NumExpr)
- `RAYON_RS_NUM_CPUS` (HuggingFace fast tokenizer / Rust)
- `TF_NUM_INTEROP_THREADS` (TensorFlow inter-op)
- `TF_NUM_INTRAOP_THREADS` (TensorFlow intra-op)
Multi-threading runners: All set to `math.ceil(cpus)` (the full CPU allocation).
Non-multi-threading runners: All set to `"1"`.
GPU workers: Thread variables are NOT set for GPU workers; `CUDA_VISIBLE_DEVICES` is set instead.
Trade-off: Prevents CPU over-subscription but may limit parallelism within a single worker if manually overridden.

Reasoning

Most numerical computing libraries default to using all available CPU cores for their internal thread pools. When multiple BentoML runner workers run on the same machine, each would try to use all cores simultaneously, causing severe contention and degraded performance. By explicitly setting thread counts, BentoML ensures each worker only uses its fair share of CPU resources.

The choice to set these per-worker rather than globally allows different runners on the same machine to have different thread configurations based on their `SUPPORTS_CPU_MULTI_THREADING` flag.

Thread environment list from `strategy.py:47-59`:

THREAD_ENVS = [
    "BENTOML_NUM_THREAD",      # For custom Runner code
    "OMP_NUM_THREADS",          # openmp
    "OPENBLAS_NUM_THREADS",     # openblas
    "MKL_NUM_THREADS",          # mkl
    "VECLIB_MAXIMUM_THREADS",   # accelerate
    "NUMEXPR_NUM_THREADS",      # numexpr
    "RAYON_RS_NUM_CPUS",        # HuggingFace fast tokenizer
    "TF_NUM_INTEROP_THREADS",   # Tensorflow
    "TF_NUM_INTRAOP_THREADS",   # Tensorflow
]  # TODO(jiang): make it configurable?

Thread count assignment from `strategy.py:169-182`:

if runnable_class.SUPPORTS_CPU_MULTI_THREADING:
    thread_count = math.ceil(cpus)
    for thread_env in THREAD_ENVS:
        environ[thread_env] = str(thread_count)
else:
    for thread_env in THREAD_ENVS:
        environ[thread_env] = "1"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment