Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm Vllm Serve CLI

From Leeroopedia


Knowledge Sources
Domains LLM Serving, CLI Tools, HTTP Services
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for launching an OpenAI-compatible HTTP API server provided by the vllm library.

Description

The vllm serve command is the primary CLI entry point for deploying a vLLM inference server. It parses command-line arguments, constructs the engine configuration, and starts a FastAPI/uvicorn HTTP server that exposes OpenAI-compatible endpoints. The command is implemented by the ServeSubcommand class in vllm/entrypoints/cli/serve.py.

The serve command supports three operational modes:

  • Single API server (default): One process handles both the HTTP frontend and the inference engine.
  • Multi API server: Multiple API server processes share a set of engine cores, useful for data-parallel deployments.
  • Headless mode: Engine cores run without an HTTP frontend, for use in disaggregated or multi-node deployments.

The server defaults to the Qwen/Qwen3-0.6B model if none is specified. All EngineArgs parameters are available as CLI flags (with hyphens replacing underscores).

Usage

Use vllm serve to start a production or development inference server. The model is specified as a positional argument, and all engine and frontend parameters are available as optional flags. The server runs until interrupted (Ctrl+C or SIGTERM).

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/entrypoints/cli/serve.py (Lines 48-111)
  • Related: vllm/entrypoints/openai/cli_args.py (argument definitions)

Signature

vllm serve [model_tag] [options]

Import

# Not typically imported; invoked via CLI. For programmatic use:
from vllm.entrypoints.openai.api_server import run_server

I/O Contract

Inputs

Name Type Required Description
model_tag positional str No HuggingFace model ID or local path. Defaults to Qwen/Qwen3-0.6B if omitted.
--tensor-parallel-size int No Number of GPUs for tensor parallelism. Default: 1.
--host str No Hostname to bind to. Default: None (all interfaces).
--port int No Port number to listen on. Default: 8000.
--api-key str No API key(s) for authenticating client requests. Default: None (no auth).
--dtype str No Model weight data type: "auto", "float16", "bfloat16", "float32". Default: "auto".
--quantization str No Quantization method: "awq", "gptq", "fp8", etc. Default: None.
--max-model-len int No Maximum total sequence length. Default: derived from model config.
--gpu-memory-utilization float No Fraction of GPU memory for the engine (0.0-1.0). Default: 0.9.
--enable-lora flag No Enable LoRA adapter support.
--chat-template str No Path to a Jinja2 chat template file.
--config str No Path to a YAML config file with CLI options.
--headless flag No Run without an HTTP frontend (for multi-node setups).
--api-server-count int No Number of API server worker processes. Default: data_parallel_size.

Outputs

Name Type Description
HTTP Server Running process A uvicorn HTTP server listening on the specified host:port.
/v1/chat/completions HTTP endpoint OpenAI-compatible chat completion endpoint.
/v1/completions HTTP endpoint OpenAI-compatible text completion endpoint.
/v1/models HTTP endpoint Lists available models.
/metrics HTTP endpoint Prometheus metrics in text format.
/health HTTP endpoint Health check returning 200 when ready.

Usage Examples

Basic Server Launch

# Serve a chat model on default port 8000
vllm serve meta-llama/Llama-2-7b-chat-hf

Multi-GPU with Authentication

# Serve a 70B model across 4 GPUs with API key protection
vllm serve meta-llama/Llama-2-70b-chat-hf \
    --tensor-parallel-size 4 \
    --dtype float16 \
    --api-key my-secret-key \
    --host 0.0.0.0 \
    --port 8080

Quantized Model with Custom Sequence Length

# Serve a quantized model with a capped sequence length
vllm serve TheBloke/Llama-2-13B-Chat-AWQ \
    --quantization awq \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.85

Using a YAML Config File

# Load all settings from a config file
vllm serve --config serve_config.yaml

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment