Workflow:Huggingface Open r1 Reasoning Data Generation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Data_Engineering, Reasoning |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
End-to-end process for generating synthetic reasoning trace datasets from teacher models using vLLM-powered inference pipelines at scale.
Description
This workflow produces high-quality reasoning trace datasets by running inference on a teacher model (e.g., DeepSeek-R1 or its distilled variants) across large problem sets. The generated data captures step-by-step reasoning (think/answer format) that can be used for subsequent SFT distillation training. Two pipeline approaches are supported: a Distilabel-based pipeline for structured data generation with Ray parallelism, and a high-concurrency async script for direct vLLM API interaction with resumable processing.
Goal: A dataset of reasoning traces on the HuggingFace Hub, with multiple generations per problem for downstream quality filtering.
Scope: From a source problem dataset and a teacher model to a published reasoning trace dataset.
Strategy: Uses vLLM for high-throughput inference, either through Distilabel's pipeline abstraction (with Ray for multi-node) or a custom async client with concurrent request management.
Usage
Execute this workflow when you need to generate training data for the SFT distillation workflow. This is the data creation step that precedes model training. Use the Distilabel pipeline for structured, reproducible runs with smaller models (single GPU). Use the async generation script for large-scale runs with DeepSeek-R1 (multi-node vLLM serving) or when you need resumable processing for very large datasets.
Execution Steps
Step 1: Infrastructure_Setup
Deploy the vLLM serving infrastructure. For small distilled models, a single GPU suffices with vLLM running in-process. For the full DeepSeek-R1 model, deploy a multi-node vLLM server using Slurm across multiple GPU nodes (e.g., 2 nodes of 8xH100s). Install Distilabel with vLLM and optionally Ray/OpenAI client support.
Key considerations:
- Model size determines infrastructure needs (7B on 1 GPU, 671B on 16+ GPUs)
- vLLM server exposes an OpenAI-compatible API for client interaction
- For multi-node setups, Ray dashboard access can be configured via SSH tunnel
Step 2: Source_Dataset_Selection
Select the input problem dataset from the HuggingFace Hub. The dataset should contain problems with a designated prompt column (e.g., "problem" for NuminaMath). Identify the appropriate column mapping and configure the prompt template that wraps each problem with instructions for the model.
Key considerations:
- The prompt template guides the model's reasoning format (e.g., "put your final answer within \boxed{}")
- Different datasets use different column names for the problem text
- Dataset split and config must be specified for Hub datasets with multiple configurations
Step 3: Generation_Pipeline_Configuration
Configure the generation pipeline with model-specific parameters: temperature (typically 0.6), top-p sampling (0.95), maximum new tokens (8192-32768), and number of generations per problem (1-4 for diversity). Choose between the Distilabel pipeline (structured, with input batch size and client replicas) or the async script (with concurrency limits and retry budgets).
Key considerations:
- Temperature and top-p control reasoning diversity across generations
- Multiple generations per problem enable downstream quality filtering
- The async script supports up to 1000 concurrent requests with automatic retry
- Distilabel provides group_generations mode for organized multi-generation output
Step 4: Batch_Generation_Execution
Run the generation pipeline across the full dataset. The Distilabel pipeline processes the dataset in batches with Ray-based parallelism, while the async script sends concurrent requests to the vLLM API with progress tracking. Both approaches support resumable processing: Distilabel via its caching mechanism, and the async script via UUID-based deduplication of already-processed examples.
Key considerations:
- Generation is the most time-consuming step and benefits from batched processing
- The async script writes results incrementally to JSONL for crash recovery
- Request timeouts should be generous (600-900 seconds) for long reasoning traces
- Progress is tracked via tqdm with active task count monitoring
Step 5: Dataset_Publishing
Push the generated dataset to the HuggingFace Hub. The Distilabel pipeline produces a Distiset object that can be pushed directly. The async script's JSONL output needs to be loaded and formatted before publishing. The dataset includes all original fields plus the generated reasoning traces, finish reasons, and API metadata.
Key considerations:
- Generated datasets can be made public or private on the Hub
- Multiple generation runs can be combined into a single dataset
- The output preserves original dataset fields alongside generated content