Implementation:Huggingface Open r1 Build Distilabel Pipeline
Overview
Concrete tool for constructing a Distilabel-based reasoning generation pipeline with vLLM backend provided by Open-R1.
Description
The build_distilabel_pipeline function creates a Distilabel Pipeline configured with a Ray backend for distributed execution. It uses OpenAILLM (pointed at a vLLM server) as the LLM provider and TextGeneration as the processing step. The pipeline supports configurable prompt templates (Jinja2 syntax), multiple client replicas for throughput, and batched processing.
Usage
Import when you need to generate synthetic reasoning traces from a teacher model served via vLLM.
Code Reference
Source: Repository: open-r1, File: src/open_r1/generate.py, Lines: L23-63
Signature:
def build_distilabel_pipeline(
model: str,
base_url: str = "http://localhost:8000/v1",
prompt_column: Optional[str] = None,
prompt_template: str = "{{ instruction }}",
temperature: Optional[float] = None,
top_p: Optional[float] = None,
max_new_tokens: int = 8192,
num_generations: int = 1,
input_batch_size: int = 64,
client_replicas: int = 1,
timeout: int = 900,
retries: int = 0,
) -> Pipeline:
Import:
from open_r1.generate import build_distilabel_pipeline
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | str | Yes | vLLM model name |
| base_url | str | No | vLLM server URL (default: http://localhost:8000/v1)
|
| prompt_column | str | No | Column name in the source dataset to use as prompt |
| prompt_template | str | No | Jinja2 template for formatting prompts (default: {{ instruction }})
|
| temperature | float | No | Sampling temperature for generation |
| top_p | float | No | Nucleus sampling parameter |
| max_new_tokens | int | No | Maximum number of new tokens to generate (default: 8192) |
| num_generations | int | No | Number of completions per prompt (default: 1) |
| input_batch_size | int | No | Batch size for processing inputs (default: 64) |
| client_replicas | int | No | Number of parallel client replicas for throughput (default: 1) |
| timeout | int | No | Request timeout in seconds (default: 900) |
| retries | int | No | Number of retries on failure (default: 0) |
Outputs
| Return Type | Description |
|---|---|
| Pipeline | Configured Distilabel Pipeline with Ray backend, ready for execution via pipeline.run()
|
Usage Examples
from datasets import load_dataset
from open_r1.generate import build_distilabel_pipeline
# Build the pipeline with a DeepSeek-R1 model served via vLLM
pipeline = build_distilabel_pipeline(
model="deepseek-ai/DeepSeek-R1",
base_url="http://localhost:8000/v1",
prompt_column="problem",
prompt_template="{{ problem }}",
temperature=0.6,
top_p=0.95,
max_new_tokens=8192,
num_generations=1,
input_batch_size=64,
client_replicas=4,
timeout=900,
)
# Load the source dataset
dataset = load_dataset("AI-MO/NuminaMath-TIR", split="train")
# Run the pipeline to generate reasoning traces
distiset = pipeline.run(dataset=dataset)
# Push the generated dataset to the Hub
distiset.push_to_hub("my-org/numina-deepseek-r1-traces")