Implementation:Vllm project Vllm Vllm Serve CLI

Knowledge Sources	vLLM vLLM Docs
Domains	LLM Serving, CLI Tools, HTTP Services
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for launching an OpenAI-compatible HTTP API server provided by the vllm library.

Description

The vllm serve command is the primary CLI entry point for deploying a vLLM inference server. It parses command-line arguments, constructs the engine configuration, and starts a FastAPI/uvicorn HTTP server that exposes OpenAI-compatible endpoints. The command is implemented by the ServeSubcommand class in vllm/entrypoints/cli/serve.py.

The serve command supports three operational modes:

Single API server (default): One process handles both the HTTP frontend and the inference engine.
Multi API server: Multiple API server processes share a set of engine cores, useful for data-parallel deployments.
Headless mode: Engine cores run without an HTTP frontend, for use in disaggregated or multi-node deployments.

The server defaults to the Qwen/Qwen3-0.6B model if none is specified. All EngineArgs parameters are available as CLI flags (with hyphens replacing underscores).

Usage

Use vllm serve to start a production or development inference server. The model is specified as a positional argument, and all engine and frontend parameters are available as optional flags. The server runs until interrupted (Ctrl+C or SIGTERM).

Code Reference

Source Location

Repository: vllm
File: vllm/entrypoints/cli/serve.py (Lines 48-111)
Related: vllm/entrypoints/openai/cli_args.py (argument definitions)

Signature

vllm serve [model_tag] [options]

Import

# Not typically imported; invoked via CLI. For programmatic use:
from vllm.entrypoints.openai.api_server import run_server

I/O Contract

Inputs

Name	Type	Required	Description
model_tag	`positional str`	No	HuggingFace model ID or local path. Defaults to Qwen/Qwen3-0.6B if omitted.
--tensor-parallel-size	`int`	No	Number of GPUs for tensor parallelism. Default: 1.
--host	`str`	No	Hostname to bind to. Default: None (all interfaces).
--port	`int`	No	Port number to listen on. Default: 8000.
--api-key	`str`	No	API key(s) for authenticating client requests. Default: None (no auth).
--dtype	`str`	No	Model weight data type: "auto", "float16", "bfloat16", "float32". Default: "auto".
--quantization	`str`	No	Quantization method: "awq", "gptq", "fp8", etc. Default: None.
--max-model-len	`int`	No	Maximum total sequence length. Default: derived from model config.
--gpu-memory-utilization	`float`	No	Fraction of GPU memory for the engine (0.0-1.0). Default: 0.9.
--enable-lora	`flag`	No	Enable LoRA adapter support.
--chat-template	`str`	No	Path to a Jinja2 chat template file.
--config	`str`	No	Path to a YAML config file with CLI options.
--headless	`flag`	No	Run without an HTTP frontend (for multi-node setups).
--api-server-count	`int`	No	Number of API server worker processes. Default: data_parallel_size.

Outputs

Name	Type	Description
HTTP Server	Running process	A uvicorn HTTP server listening on the specified host:port.
/v1/chat/completions	HTTP endpoint	OpenAI-compatible chat completion endpoint.
/v1/completions	HTTP endpoint	OpenAI-compatible text completion endpoint.
/v1/models	HTTP endpoint	Lists available models.
/metrics	HTTP endpoint	Prometheus metrics in text format.
/health	HTTP endpoint	Health check returning 200 when ready.

Usage Examples

Basic Server Launch

# Serve a chat model on default port 8000
vllm serve meta-llama/Llama-2-7b-chat-hf

Multi-GPU with Authentication

# Serve a 70B model across 4 GPUs with API key protection
vllm serve meta-llama/Llama-2-70b-chat-hf \
    --tensor-parallel-size 4 \
    --dtype float16 \
    --api-key my-secret-key \
    --host 0.0.0.0 \
    --port 8080

Quantized Model with Custom Sequence Length

# Serve a quantized model with a capped sequence length
vllm serve TheBloke/Llama-2-13B-Chat-AWQ \
    --quantization awq \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.85

Using a YAML Config File

# Load all settings from a config file
vllm serve --config serve_config.yaml

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_API_Server_Deployment

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment