Heuristic:Bentoml BentoML Worker Count Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Infrastructure |
| Last Updated | 2026-02-13 16:00 GMT |
Overview
Strategy for determining API server and runner worker counts based on CPU cores and GPU count, including the auto-scaling heuristic of `math.ceil(cpu_count)` for API workers.
Description
BentoML automatically determines worker counts through two mechanisms: (1) API server workers default to `math.ceil(CpuResource.from_system())` when not explicitly configured (config value `workers: ~`), and (2) runner workers are calculated by `DefaultStrategy.get_worker_count()` based on GPU count or CPU count multiplied by `workers_per_resource`. The CPU detection is cgroup-aware, reading from `/sys/fs/cgroup/cpu/cpu.cfs_quota_us` (cgroup v1) or `/sys/fs/cgroup/cpu.max` (cgroup v2) on Linux, enabling correct behavior inside containers with CPU limits.
Usage
Use this heuristic when sizing your BentoML deployment. Apply when choosing worker counts for production serving, especially in containerized environments where cgroup limits should be respected. Critical when deploying services that need to balance CPU-bound API processing against GPU-bound model inference.
The Insight (Rule of Thumb)
- API server workers: Default is `math.ceil(cpu_count())`. Override with `api_server.workers` in v1 config or `services.workers` in v2 config.
- GPU runner workers: `math.ceil(len(nvidia_gpus) * workers_per_resource)`. Default `workers_per_resource=1` means one worker per GPU.
- CPU runner workers (multi-threading): Returns `workers_per_resource` value directly (typically 1). The single worker uses all available CPU threads.
- CPU runner workers (no multi-threading): `math.ceil(cpus) * workers_per_resource`. One worker per CPU core.
- Container-aware: CPU detection reads cgroup v1/v2 quotas. If container has a 4-CPU limit, `from_system()` returns 4.0, not the host CPU count.
- Trade-off: More API workers handle more concurrent HTTP connections but use more memory. GPU runners are typically 1-per-GPU unless using fractional GPU allocation.
Reasoning
The worker count strategy is designed to maximize resource utilization without over-provisioning. For GPU-bound models, one worker per GPU ensures exclusive GPU access. For CPU-bound models that support multi-threading (like scikit-learn, XGBoost), a single worker with all threads is more efficient than multiple single-threaded workers. The cgroup-awareness is critical for Kubernetes deployments where CPU limits are enforced through cgroups.
API worker calculation from `containers.py:287-290`:
api_server_workers = providers.Factory[int](
lambda workers: workers or math.ceil(CpuResource.from_system()),
api_server_config.workers,
)
Runner worker strategy from `strategy.py:64-98`:
@classmethod
def get_worker_count(cls, runnable_class, resource_request, workers_per_resource):
# use nvidia gpu
nvidia_gpus = get_resource(resource_request, "nvidia.com/gpu")
if nvidia_gpus is not None and len(nvidia_gpus) > 0 \
and "nvidia.com/gpu" in runnable_class.SUPPORTED_RESOURCES:
return math.ceil(len(nvidia_gpus) * workers_per_resource)
# use CPU
cpus = get_resource(resource_request, "cpu")
if cpus is not None and cpus > 0:
if runnable_class.SUPPORTS_CPU_MULTI_THREADING:
return workers_per_resource # typically 1
return math.ceil(cpus) * workers_per_resource
Cgroup-aware CPU detection from `resource.py:124-174`:
def query_cgroup_cpu_count() -> float:
cgroup_root = "/sys/fs/cgroup/"
cfs_quota_us_file = os.path.join(cgroup_root, "cpu", "cpu.cfs_quota_us")
cfs_period_us_file = os.path.join(cgroup_root, "cpu", "cpu.cfs_period_us")
cpu_max_file = os.path.join(cgroup_root, "cpu.max")
# ... reads quota from cgroup v1 or v2
return float(min(limit_count, os_cpu_count))