Implementation:Huggingface Open r1 Build Distilabel Pipeline

Overview

Concrete tool for constructing a Distilabel-based reasoning generation pipeline with vLLM backend provided by Open-R1.

Description

The build_distilabel_pipeline function creates a Distilabel Pipeline configured with a Ray backend for distributed execution. It uses OpenAILLM (pointed at a vLLM server) as the LLM provider and TextGeneration as the processing step. The pipeline supports configurable prompt templates (Jinja2 syntax), multiple client replicas for throughput, and batched processing.

Usage

Import when you need to generate synthetic reasoning traces from a teacher model served via vLLM.

Code Reference

Source: Repository: open-r1, File: src/open_r1/generate.py, Lines: L23-63

Signature:

def build_distilabel_pipeline(
    model: str,
    base_url: str = "http://localhost:8000/v1",
    prompt_column: Optional[str] = None,
    prompt_template: str = "{{ instruction }}",
    temperature: Optional[float] = None,
    top_p: Optional[float] = None,
    max_new_tokens: int = 8192,
    num_generations: int = 1,
    input_batch_size: int = 64,
    client_replicas: int = 1,
    timeout: int = 900,
    retries: int = 0,
) -> Pipeline:

Import:

from open_r1.generate import build_distilabel_pipeline

I/O Contract

Inputs

Parameter	Type	Required	Description
model	str	Yes	vLLM model name
base_url	str	No	vLLM server URL (default: `http://localhost:8000/v1`)
prompt_column	str	No	Column name in the source dataset to use as prompt
prompt_template	str	No	Jinja2 template for formatting prompts (default: `{{ instruction }}`)
temperature	float	No	Sampling temperature for generation
top_p	float	No	Nucleus sampling parameter
max_new_tokens	int	No	Maximum number of new tokens to generate (default: 8192)
num_generations	int	No	Number of completions per prompt (default: 1)
input_batch_size	int	No	Batch size for processing inputs (default: 64)
client_replicas	int	No	Number of parallel client replicas for throughput (default: 1)
timeout	int	No	Request timeout in seconds (default: 900)
retries	int	No	Number of retries on failure (default: 0)

Outputs

Return Type	Description
Pipeline	Configured Distilabel Pipeline with Ray backend, ready for execution via `pipeline.run()`

Usage Examples

from datasets import load_dataset
from open_r1.generate import build_distilabel_pipeline

# Build the pipeline with a DeepSeek-R1 model served via vLLM
pipeline = build_distilabel_pipeline(
    model="deepseek-ai/DeepSeek-R1",
    base_url="http://localhost:8000/v1",
    prompt_column="problem",
    prompt_template="{{ problem }}",
    temperature=0.6,
    top_p=0.95,
    max_new_tokens=8192,
    num_generations=1,
    input_batch_size=64,
    client_replicas=4,
    timeout=900,
)

# Load the source dataset
dataset = load_dataset("AI-MO/NuminaMath-TIR", split="train")

# Run the pipeline to generate reasoning traces
distiset = pipeline.run(dataset=dataset)

# Push the generated dataset to the Hub
distiset.push_to_hub("my-org/numina-deepseek-r1-traces")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment