Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Pytorch Serve Batch Size Tuning

From Leeroopedia
Knowledge Sources
Domains Optimization, Inference
Last Updated 2026-02-13 00:00 GMT

Overview

Batch size and delay tuning for balancing throughput vs latency, with special batch_size=1 requirement for vLLM and GPT-Fast handlers.

Description

TorchServe's batching system aggregates incoming requests into batches before passing them to the handler. The `batch_size` and `max_batch_delay` parameters control this behavior. Larger batches improve GPU utilization and throughput but increase per-request latency. For LLM handlers (vLLM, GPT-Fast), TorchServe-level batching must be set to 1 because these engines implement their own internal batching via continuous batching or `max_num_seqs`. The GPU worker count formula `(Number of GPUs) / (Number of Unique Models)` prevents GPU contention.

Usage

Apply this heuristic when configuring TorchServe for production and optimizing for either throughput or latency. Critical for any deployment using batch inference, vLLM serving, or multi-model GPU deployments.

The Insight (Rule of Thumb)

  • batch_size: Start with 8 for standard models. Increase for throughput; decrease for latency.
  • max_batch_delay: 50-100ms is typical. This is the maximum time TorchServe waits to fill a batch before processing.
  • LLM handlers (vLLM, GPT-Fast): Must use `batchSize: 1`. These engines handle internal batching via `max_num_seqs` (vLLM default: 256).
  • GPU worker count: `number_of_gpu = (Hardware GPUs) / (Unique Models)` to avoid contention.
  • Trade-off: Higher batch_size = higher throughput but higher p99 latency. Must stay within latency SLA.

Reasoning

GPU inference is most efficient when the hardware is fully utilized. Small batches leave GPU cores idle, while large batches amortize kernel launch overhead and improve memory access patterns. However, batching introduces queuing delay: a request arriving just after a batch is dispatched must wait up to `max_batch_delay` milliseconds.

For LLM engines like vLLM, the engine implements its own continuous batching with PagedAttention. Passing multiple requests in a single TorchServe batch would conflict with vLLM's internal scheduling. The assertions `len(requests) == 1` in both `vllm_handler.py` and GPT-Fast's handler enforce this separation of concerns.

The GPU worker formula ensures that if you have 4 GPUs and 2 models, each model gets 2 GPU workers, avoiding scenarios where one model monopolizes all GPUs while another starves.

Code Evidence

vLLM batch_size=1 enforcement from `ts/torch_handler/vllm_handler.py:109`:

assert len(requests) == 1, "Expecting batch_size = 1"

GPT-Fast batch_size=1 enforcement from `examples/large_models/gpt_fast/handler.py:115-117`:

assert (
    len(requests) == 1
), "GPT fast is currently only supported with batch_size=1"

LLM launcher default config from `ts/llm_launcher.py:64-73`:

model_config = {
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "responseTimeout": 1200,
    "startupTimeout": args.startup_timeout,
    "deviceType": "gpu",
    "asyncCommunication": True,
}

GPU worker formula from `docs/performance_guide.md:79`:

ValueToSet = (Number of Hardware GPUs) / (Number of Unique Models)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment