Implementation:Vllm project Vllm Vllm Serve CLI
| Knowledge Sources | |
|---|---|
| Domains | LLM Serving, CLI Tools, HTTP Services |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for launching an OpenAI-compatible HTTP API server provided by the vllm library.
Description
The vllm serve command is the primary CLI entry point for deploying a vLLM inference server. It parses command-line arguments, constructs the engine configuration, and starts a FastAPI/uvicorn HTTP server that exposes OpenAI-compatible endpoints. The command is implemented by the ServeSubcommand class in vllm/entrypoints/cli/serve.py.
The serve command supports three operational modes:
- Single API server (default): One process handles both the HTTP frontend and the inference engine.
- Multi API server: Multiple API server processes share a set of engine cores, useful for data-parallel deployments.
- Headless mode: Engine cores run without an HTTP frontend, for use in disaggregated or multi-node deployments.
The server defaults to the Qwen/Qwen3-0.6B model if none is specified. All EngineArgs parameters are available as CLI flags (with hyphens replacing underscores).
Usage
Use vllm serve to start a production or development inference server. The model is specified as a positional argument, and all engine and frontend parameters are available as optional flags. The server runs until interrupted (Ctrl+C or SIGTERM).
Code Reference
Source Location
- Repository: vllm
- File:
vllm/entrypoints/cli/serve.py(Lines 48-111) - Related:
vllm/entrypoints/openai/cli_args.py(argument definitions)
Signature
vllm serve [model_tag] [options]
Import
# Not typically imported; invoked via CLI. For programmatic use:
from vllm.entrypoints.openai.api_server import run_server
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_tag | positional str |
No | HuggingFace model ID or local path. Defaults to Qwen/Qwen3-0.6B if omitted. |
| --tensor-parallel-size | int |
No | Number of GPUs for tensor parallelism. Default: 1. |
| --host | str |
No | Hostname to bind to. Default: None (all interfaces). |
| --port | int |
No | Port number to listen on. Default: 8000. |
| --api-key | str |
No | API key(s) for authenticating client requests. Default: None (no auth). |
| --dtype | str |
No | Model weight data type: "auto", "float16", "bfloat16", "float32". Default: "auto". |
| --quantization | str |
No | Quantization method: "awq", "gptq", "fp8", etc. Default: None. |
| --max-model-len | int |
No | Maximum total sequence length. Default: derived from model config. |
| --gpu-memory-utilization | float |
No | Fraction of GPU memory for the engine (0.0-1.0). Default: 0.9. |
| --enable-lora | flag |
No | Enable LoRA adapter support. |
| --chat-template | str |
No | Path to a Jinja2 chat template file. |
| --config | str |
No | Path to a YAML config file with CLI options. |
| --headless | flag |
No | Run without an HTTP frontend (for multi-node setups). |
| --api-server-count | int |
No | Number of API server worker processes. Default: data_parallel_size. |
Outputs
| Name | Type | Description |
|---|---|---|
| HTTP Server | Running process | A uvicorn HTTP server listening on the specified host:port. |
| /v1/chat/completions | HTTP endpoint | OpenAI-compatible chat completion endpoint. |
| /v1/completions | HTTP endpoint | OpenAI-compatible text completion endpoint. |
| /v1/models | HTTP endpoint | Lists available models. |
| /metrics | HTTP endpoint | Prometheus metrics in text format. |
| /health | HTTP endpoint | Health check returning 200 when ready. |
Usage Examples
Basic Server Launch
# Serve a chat model on default port 8000
vllm serve meta-llama/Llama-2-7b-chat-hf
Multi-GPU with Authentication
# Serve a 70B model across 4 GPUs with API key protection
vllm serve meta-llama/Llama-2-70b-chat-hf \
--tensor-parallel-size 4 \
--dtype float16 \
--api-key my-secret-key \
--host 0.0.0.0 \
--port 8080
Quantized Model with Custom Sequence Length
# Serve a quantized model with a capped sequence length
vllm serve TheBloke/Llama-2-13B-Chat-AWQ \
--quantization awq \
--max-model-len 8192 \
--gpu-memory-utilization 0.85
Using a YAML Config File
# Load all settings from a config file
vllm serve --config serve_config.yaml