Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Huggingface Open r1 vLLM Server

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Inference
Last Updated 2026-02-08 00:00 GMT

Overview

A vLLM inference server deployment providing an OpenAI-compatible API for synthetic data generation and pass rate filtering pipelines.

Description

This environment defines the vLLM server deployment required by the Distilabel pipeline builder and the async reasoning generation script. The server exposes an OpenAI-compatible /v1/chat/completions endpoint that clients connect to for batch generation. Open-R1 uses two distinct patterns for connecting to vLLM: the Distilabel OpenAILLM wrapper (which sends requests to http://localhost:8000/v1 by default) and the raw aiohttp client in generate_reasoning.py (which defaults to localhost:39876). The vLLM server can also be used in the pass rate filtering pipeline where it runs as an in-process vllm.LLM engine rather than a standalone server.

Usage

Use this environment when running Synthetic Data Generation (Distilabel pipeline) or High-Concurrency Inference (async reasoning generation). It is also indirectly required by the Pass Rate Filtering pipeline which uses vLLM's in-process LLM engine. The vLLM server should be started separately before running the generation scripts.

System Requirements

Category Requirement Notes
OS Linux vLLM only officially supports Linux
Hardware NVIDIA GPU(s) Must have enough VRAM to serve the target model; multi-GPU for tensor parallelism
CUDA 12.4 Must match the PyTorch/vLLM build
Network TCP port available Default ports: 8000 (Distilabel pipeline), 39876 (generate_reasoning.py)

Dependencies

System Packages

  • cuda-toolkit = 12.4

Python Packages

  • vllm == 0.8.5.post1
  • torch == 2.6.0
  • transformers == 4.52.3
  • distilabel[vllm,ray,openai] >= 1.5.2 (for Distilabel pipeline)
  • aiohttp (for async generation script)
  • uvloop (for high-performance async event loop in generate_reasoning.py)
  • tqdm (for progress tracking)
  • aiofiles >= 24.1.0 (for async file I/O)

Credentials

The following environment variables may be required:

  • HF_TOKEN: HuggingFace API token for downloading gated models to serve via vLLM.

Note: The Distilabel pipeline sets api_key="something" as a placeholder since the vLLM server is local and does not require authentication. The generate_reasoning.py script uses Authorization: Bearer EMPTY.

Quick Install

# Install vLLM (in the same environment as Open-R1)
uv pip install vllm==0.8.5.post1

# Start vLLM server for Distilabel pipeline usage
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1 \
    --port 8000 \
    --tensor-parallel-size 8

# Or start vLLM server for generate_reasoning.py usage
python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-R1 \
    --port 39876 \
    --tensor-parallel-size 8

Code Evidence

Default vLLM URL in Distilabel pipeline from src/open_r1/generate.py:25:

def build_distilabel_pipeline(
    model: str,
    base_url: str = "http://localhost:8000/v1",
    ...
    timeout: int = 900,
    retries: int = 0,
) -> Pipeline:

Dummy API key for local vLLM from src/open_r1/generate.py:49:

llm=OpenAILLM(
    base_url=base_url,
    api_key="something",
    model=model,
    timeout=timeout,
    max_retries=retries,
    generation_kwargs=generation_kwargs,
),

Async client connecting to vLLM from scripts/generate_reasoning.py:26-36:

async with session.post(
    f"http://{args.api_addr}/v1/chat/completions",
    json={
        "model": "default",
        "messages": [{"role": "user", "content": prompt}],
        "max_tokens": args.max_tokens,
        "temperature": args.temperature,
        "top_p": args.top_p,
    },
    headers={"Authorization": "Bearer EMPTY"},
) as response:
    return await response.json(content_type=None)

High-concurrency client configuration from scripts/generate_reasoning.py:143-145:

async with aiohttp.ClientSession(
    timeout=aiohttp.ClientTimeout(total=60 * 60),
    connector=aiohttp.TCPConnector(limit=args.max_concurrent, ttl_dns_cache=300, keepalive_timeout=60 * 60),
) as session:

Common Errors

Error Message Cause Solution
Connection refused on port 8000/39876 vLLM server not running Start the vLLM server before running generation scripts
API error (will retry) Server overloaded or temporarily unavailable Script has built-in retry (budget of 10 with 10s backoff); check server health
Model OOM during serving Model too large for available VRAM Increase --tensor-parallel-size or use a smaller model
Timeout after 900s Generation taking too long Increase timeout parameter or reduce max_new_tokens

Compatibility Notes

  • Port configuration: Distilabel pipeline defaults to port 8000, while generate_reasoning.py defaults to port 39876. Ensure the correct port is used for each workflow.
  • Ray backend: The Distilabel pipeline uses Ray for distributed processing (Pipeline().ray()). Ensure Ray is properly configured if using multiple client replicas.
  • uvloop: The generate_reasoning.py script installs uvloop for faster async I/O (uvloop.install()).
  • Concurrency limit: The default max_concurrent is 1000 simultaneous requests; adjust based on server capacity.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment