Heuristic:Pytorch Serve LLM Timeout Configuration

Knowledge Sources	Pytorch_Serve
Domains	LLMs, Configuration
Last Updated	2026-02-13 00:00 GMT

Overview

Timeout and async configuration for LLM serving: responseTimeout=1200s, startupTimeout=1200s, and asyncCommunication=true are essential defaults.

Description

Large language models have fundamentally different timing characteristics than traditional vision or classification models. Model loading can take 5-15 minutes for multi-billion parameter models, and inference for long sequences can take minutes. TorchServe's default timeouts are too short for LLM workloads. The LLM launcher hardcodes a 1200-second (20-minute) response timeout and enables async communication to support non-blocking streaming inference. The `maxBatchDelay` of 100ms is kept low for LLMs to minimize first-token latency.

Usage

Apply this heuristic when deploying any large language model via TorchServe, regardless of engine (vLLM or TensorRT-LLM). Failure to increase timeouts is one of the most common causes of LLM deployment failures.

The Insight (Rule of Thumb)

responseTimeout: Set to 1200 seconds (20 minutes) for LLM inference. Default TorchServe timeout is too short.
startupTimeout: Set to 1200 seconds (20 minutes). Large models take time to download and load into GPU memory.
maxBatchDelay: Keep at 100ms for LLMs. Lower values reduce first-token latency.
asyncCommunication: Must be `true` for streaming token generation and non-blocking inference.
Workers: Use `minWorkers: 1, maxWorkers: 1` for LLMs. Each worker loads a full model copy.
Trade-off: Higher timeouts consume more server resources per connection. Set as high as needed but not arbitrarily higher.

Reasoning

LLMs differ from traditional models in three critical ways:

1. Startup time: A 7B parameter model is ~14GB in fp16. Downloading from HuggingFace Hub and loading into GPU memory takes minutes, not seconds. Without sufficient startup timeout, the worker process is killed before the model finishes loading.

2. Inference time: Generating 2048 tokens at 30 tokens/second takes ~68 seconds. For longer sequences or slower hardware, inference can approach minutes. The default TorchServe response timeout would terminate the request prematurely.

3. Streaming: LLM applications expect token-by-token streaming. The `asyncCommunication: true` flag enables TorchServe's intermediate response mechanism, allowing the handler to send tokens as they are generated rather than waiting for the full response.

The TensorRT-LLM engine additionally uses `kv_cache_free_gpu_memory_fraction: 0.1` to reserve 10% of GPU memory for KV cache, leaving the rest for the model and activations.

Code Evidence

LLM launcher defaults from `ts/llm_launcher.py:63-73`:

model_config = {
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "responseTimeout": 1200,
    "startupTimeout": args.startup_timeout,
    "deviceType": "gpu",
    "asyncCommunication": True,
}

TensorRT-LLM KV cache config from `ts/llm_launcher.py:116-120`:

"kv_cache_config": {
    "free_gpu_memory_fraction": getattr(
        args, "trt_llm_engine.kv_cache_free_gpu_memory_fraction"
    ),
},

Streaming response from `ts/torch_handler/vllm_handler.py:150-156`:

if request.stream:
    async for response in g:
        if response != "data: [DONE]\n\n":
            send_intermediate_predict_response(
                [response], context.request_ids, "Result", 200, context
            )
    return [response]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment