Environment:Huggingface Open r1 vLLM Server

Knowledge Sources	Open R1 vLLM Distilabel
Domains	Infrastructure, Inference
Last Updated	2026-02-08 00:00 GMT

Overview

A vLLM inference server deployment providing an OpenAI-compatible API for synthetic data generation and pass rate filtering pipelines.

Description

This environment defines the vLLM server deployment required by the Distilabel pipeline builder and the async reasoning generation script. The server exposes an OpenAI-compatible /v1/chat/completions endpoint that clients connect to for batch generation. Open-R1 uses two distinct patterns for connecting to vLLM: the Distilabel OpenAILLM wrapper (which sends requests to http://localhost:8000/v1 by default) and the raw aiohttp client in generate_reasoning.py (which defaults to localhost:39876). The vLLM server can also be used in the pass rate filtering pipeline where it runs as an in-process vllm.LLM engine rather than a standalone server.

Usage

Use this environment when running Synthetic Data Generation (Distilabel pipeline) or High-Concurrency Inference (async reasoning generation). It is also indirectly required by the Pass Rate Filtering pipeline which uses vLLM's in-process LLM engine. The vLLM server should be started separately before running the generation scripts.

System Requirements

Category	Requirement	Notes
OS	Linux	vLLM only officially supports Linux
Hardware	NVIDIA GPU(s)	Must have enough VRAM to serve the target model; multi-GPU for tensor parallelism
CUDA	12.4	Must match the PyTorch/vLLM build
Network	TCP port available	Default ports: 8000 (Distilabel pipeline), 39876 (generate_reasoning.py)

Dependencies

System Packages

cuda-toolkit = 12.4

Python Packages

vllm == 0.8.5.post1
torch == 2.6.0
transformers == 4.52.3
distilabel[vllm,ray,openai] >= 1.5.2 (for Distilabel pipeline)
aiohttp (for async generation script)
uvloop (for high-performance async event loop in generate_reasoning.py)
tqdm (for progress tracking)
aiofiles >= 24.1.0 (for async file I/O)

Credentials

The following environment variables may be required:

HF_TOKEN: HuggingFace API token for downloading gated models to serve via vLLM.

Note: The Distilabel pipeline sets api_key="something" as a placeholder since the vLLM server is local and does not require authentication. The generate_reasoning.py script uses Authorization: Bearer EMPTY.

Quick Install

# Install vLLM (in the same environment as Open-R1)
uv pip install vllm==0.8.5.post1

# Start vLLM server for Distilabel pipeline usage
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1 \
    --port 8000 \
    --tensor-parallel-size 8

# Or start vLLM server for generate_reasoning.py usage
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1 \
    --port 39876 \
    --tensor-parallel-size 8

Code Evidence

Default vLLM URL in Distilabel pipeline from src/open_r1/generate.py:25:

def build_distilabel_pipeline(
    model: str,
    base_url: str = "http://localhost:8000/v1",
    ...
    timeout: int = 900,
    retries: int = 0,
) -> Pipeline:

Dummy API key for local vLLM from src/open_r1/generate.py:49:

llm=OpenAILLM(
    base_url=base_url,
    api_key="something",
    model=model,
    timeout=timeout,
    max_retries=retries,
    generation_kwargs=generation_kwargs,
),

Async client connecting to vLLM from scripts/generate_reasoning.py:26-36:

async with session.post(
    f"http://{args.api_addr}/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": args.max_tokens,
        "temperature": args.temperature,
        "top_p": args.top_p,
    },
    headers={"Authorization": "Bearer EMPTY"},
) as response:
    return await response.json(content_type=None)

High-concurrency client configuration from scripts/generate_reasoning.py:143-145:

async with aiohttp.ClientSession(
    timeout=aiohttp.ClientTimeout(total=60 * 60),
    connector=aiohttp.TCPConnector(limit=args.max_concurrent, ttl_dns_cache=300, keepalive_timeout=60 * 60),
) as session:

Common Errors

Error Message	Cause	Solution
`Connection refused` on port 8000/39876	vLLM server not running	Start the vLLM server before running generation scripts
`API error (will retry)`	Server overloaded or temporarily unavailable	Script has built-in retry (budget of 10 with 10s backoff); check server health
Model OOM during serving	Model too large for available VRAM	Increase `--tensor-parallel-size` or use a smaller model
Timeout after 900s	Generation taking too long	Increase `timeout` parameter or reduce `max_new_tokens`

Compatibility Notes

Port configuration: Distilabel pipeline defaults to port 8000, while generate_reasoning.py defaults to port 39876. Ensure the correct port is used for each workflow.
Ray backend: The Distilabel pipeline uses Ray for distributed processing (Pipeline().ray()). Ensure Ray is properly configured if using multiple client replicas.
uvloop: The generate_reasoning.py script installs uvloop for faster async I/O (uvloop.install()).
Concurrency limit: The default max_concurrent is 1000 simultaneous requests; adjust based on server capacity.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment