Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Pytorch Serve LLM Timeout Configuration

From Leeroopedia
Knowledge Sources
Domains LLMs, Configuration
Last Updated 2026-02-13 00:00 GMT

Overview

Timeout and async configuration for LLM serving: responseTimeout=1200s, startupTimeout=1200s, and asyncCommunication=true are essential defaults.

Description

Large language models have fundamentally different timing characteristics than traditional vision or classification models. Model loading can take 5-15 minutes for multi-billion parameter models, and inference for long sequences can take minutes. TorchServe's default timeouts are too short for LLM workloads. The LLM launcher hardcodes a 1200-second (20-minute) response timeout and enables async communication to support non-blocking streaming inference. The `maxBatchDelay` of 100ms is kept low for LLMs to minimize first-token latency.

Usage

Apply this heuristic when deploying any large language model via TorchServe, regardless of engine (vLLM or TensorRT-LLM). Failure to increase timeouts is one of the most common causes of LLM deployment failures.

The Insight (Rule of Thumb)

  • responseTimeout: Set to 1200 seconds (20 minutes) for LLM inference. Default TorchServe timeout is too short.
  • startupTimeout: Set to 1200 seconds (20 minutes). Large models take time to download and load into GPU memory.
  • maxBatchDelay: Keep at 100ms for LLMs. Lower values reduce first-token latency.
  • asyncCommunication: Must be `true` for streaming token generation and non-blocking inference.
  • Workers: Use `minWorkers: 1, maxWorkers: 1` for LLMs. Each worker loads a full model copy.
  • Trade-off: Higher timeouts consume more server resources per connection. Set as high as needed but not arbitrarily higher.

Reasoning

LLMs differ from traditional models in three critical ways:

1. Startup time: A 7B parameter model is ~14GB in fp16. Downloading from HuggingFace Hub and loading into GPU memory takes minutes, not seconds. Without sufficient startup timeout, the worker process is killed before the model finishes loading.

2. Inference time: Generating 2048 tokens at 30 tokens/second takes ~68 seconds. For longer sequences or slower hardware, inference can approach minutes. The default TorchServe response timeout would terminate the request prematurely.

3. Streaming: LLM applications expect token-by-token streaming. The `asyncCommunication: true` flag enables TorchServe's intermediate response mechanism, allowing the handler to send tokens as they are generated rather than waiting for the full response.

The TensorRT-LLM engine additionally uses `kv_cache_free_gpu_memory_fraction: 0.1` to reserve 10% of GPU memory for KV cache, leaving the rest for the model and activations.

Code Evidence

LLM launcher defaults from `ts/llm_launcher.py:63-73`:

model_config = {
    "minWorkers": 1,
    "maxWorkers": 1,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "responseTimeout": 1200,
    "startupTimeout": args.startup_timeout,
    "deviceType": "gpu",
    "asyncCommunication": True,
}

TensorRT-LLM KV cache config from `ts/llm_launcher.py:116-120`:

"kv_cache_config": {
    "free_gpu_memory_fraction": getattr(
        args, "trt_llm_engine.kv_cache_free_gpu_memory_fraction"
    ),
},

Streaming response from `ts/torch_handler/vllm_handler.py:150-156`:

if request.stream:
    async for response in g:
        if response != "data: [DONE]\n\n":
            send_intermediate_predict_response(
                [response], context.request_ids, "Result", 200, context
            )
    return [response]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment