Heuristic:Bentoml BentoML Worker Count Strategy

Knowledge Sources	BentoML
Domains	Optimization, Infrastructure
Last Updated	2026-02-13 16:00 GMT

Overview

Strategy for determining API server and runner worker counts based on CPU cores and GPU count, including the auto-scaling heuristic of `math.ceil(cpu_count)` for API workers.

Description

BentoML automatically determines worker counts through two mechanisms: (1) API server workers default to `math.ceil(CpuResource.from_system())` when not explicitly configured (config value `workers: ~`), and (2) runner workers are calculated by `DefaultStrategy.get_worker_count()` based on GPU count or CPU count multiplied by `workers_per_resource`. The CPU detection is cgroup-aware, reading from `/sys/fs/cgroup/cpu/cpu.cfs_quota_us` (cgroup v1) or `/sys/fs/cgroup/cpu.max` (cgroup v2) on Linux, enabling correct behavior inside containers with CPU limits.

Usage

Use this heuristic when sizing your BentoML deployment. Apply when choosing worker counts for production serving, especially in containerized environments where cgroup limits should be respected. Critical when deploying services that need to balance CPU-bound API processing against GPU-bound model inference.

The Insight (Rule of Thumb)

API server workers: Default is `math.ceil(cpu_count())`. Override with `api_server.workers` in v1 config or `services.workers` in v2 config.
GPU runner workers: `math.ceil(len(nvidia_gpus) * workers_per_resource)`. Default `workers_per_resource=1` means one worker per GPU.
CPU runner workers (multi-threading): Returns `workers_per_resource` value directly (typically 1). The single worker uses all available CPU threads.
CPU runner workers (no multi-threading): `math.ceil(cpus) * workers_per_resource`. One worker per CPU core.
Container-aware: CPU detection reads cgroup v1/v2 quotas. If container has a 4-CPU limit, `from_system()` returns 4.0, not the host CPU count.
Trade-off: More API workers handle more concurrent HTTP connections but use more memory. GPU runners are typically 1-per-GPU unless using fractional GPU allocation.

Reasoning

The worker count strategy is designed to maximize resource utilization without over-provisioning. For GPU-bound models, one worker per GPU ensures exclusive GPU access. For CPU-bound models that support multi-threading (like scikit-learn, XGBoost), a single worker with all threads is more efficient than multiple single-threaded workers. The cgroup-awareness is critical for Kubernetes deployments where CPU limits are enforced through cgroups.

API worker calculation from `containers.py:287-290`:

api_server_workers = providers.Factory[int](
    lambda workers: workers or math.ceil(CpuResource.from_system()),
    api_server_config.workers,
)

Runner worker strategy from `strategy.py:64-98`:

@classmethod
def get_worker_count(cls, runnable_class, resource_request, workers_per_resource):
    # use nvidia gpu
    nvidia_gpus = get_resource(resource_request, "nvidia.com/gpu")
    if nvidia_gpus is not None and len(nvidia_gpus) > 0 \
       and "nvidia.com/gpu" in runnable_class.SUPPORTED_RESOURCES:
        return math.ceil(len(nvidia_gpus) * workers_per_resource)
    # use CPU
    cpus = get_resource(resource_request, "cpu")
    if cpus is not None and cpus > 0:
        if runnable_class.SUPPORTS_CPU_MULTI_THREADING:
            return workers_per_resource  # typically 1
        return math.ceil(cpus) * workers_per_resource

Cgroup-aware CPU detection from `resource.py:124-174`:

def query_cgroup_cpu_count() -> float:
    cgroup_root = "/sys/fs/cgroup/"
    cfs_quota_us_file = os.path.join(cgroup_root, "cpu", "cpu.cfs_quota_us")
    cfs_period_us_file = os.path.join(cgroup_root, "cpu", "cpu.cfs_period_us")
    cpu_max_file = os.path.join(cgroup_root, "cpu.max")
    # ... reads quota from cgroup v1 or v2
    return float(min(limit_count, os_cpu_count))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment