Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Huggingface Open r1 Reasoning Data Generation

From Leeroopedia


Knowledge Sources
Domains LLMs, Data_Engineering, Reasoning
Last Updated 2026-02-08 00:00 GMT

Overview

End-to-end process for generating synthetic reasoning trace datasets from teacher models using vLLM-powered inference pipelines at scale.

Description

This workflow produces high-quality reasoning trace datasets by running inference on a teacher model (e.g., DeepSeek-R1 or its distilled variants) across large problem sets. The generated data captures step-by-step reasoning (think/answer format) that can be used for subsequent SFT distillation training. Two pipeline approaches are supported: a Distilabel-based pipeline for structured data generation with Ray parallelism, and a high-concurrency async script for direct vLLM API interaction with resumable processing.

Goal: A dataset of reasoning traces on the HuggingFace Hub, with multiple generations per problem for downstream quality filtering.

Scope: From a source problem dataset and a teacher model to a published reasoning trace dataset.

Strategy: Uses vLLM for high-throughput inference, either through Distilabel's pipeline abstraction (with Ray for multi-node) or a custom async client with concurrent request management.

Usage

Execute this workflow when you need to generate training data for the SFT distillation workflow. This is the data creation step that precedes model training. Use the Distilabel pipeline for structured, reproducible runs with smaller models (single GPU). Use the async generation script for large-scale runs with DeepSeek-R1 (multi-node vLLM serving) or when you need resumable processing for very large datasets.

Execution Steps

Step 1: Infrastructure_Setup

Deploy the vLLM serving infrastructure. For small distilled models, a single GPU suffices with vLLM running in-process. For the full DeepSeek-R1 model, deploy a multi-node vLLM server using Slurm across multiple GPU nodes (e.g., 2 nodes of 8xH100s). Install Distilabel with vLLM and optionally Ray/OpenAI client support.

Key considerations:

  • Model size determines infrastructure needs (7B on 1 GPU, 671B on 16+ GPUs)
  • vLLM server exposes an OpenAI-compatible API for client interaction
  • For multi-node setups, Ray dashboard access can be configured via SSH tunnel

Step 2: Source_Dataset_Selection

Select the input problem dataset from the HuggingFace Hub. The dataset should contain problems with a designated prompt column (e.g., "problem" for NuminaMath). Identify the appropriate column mapping and configure the prompt template that wraps each problem with instructions for the model.

Key considerations:

  • The prompt template guides the model's reasoning format (e.g., "put your final answer within \boxed{}")
  • Different datasets use different column names for the problem text
  • Dataset split and config must be specified for Hub datasets with multiple configurations

Step 3: Generation_Pipeline_Configuration

Configure the generation pipeline with model-specific parameters: temperature (typically 0.6), top-p sampling (0.95), maximum new tokens (8192-32768), and number of generations per problem (1-4 for diversity). Choose between the Distilabel pipeline (structured, with input batch size and client replicas) or the async script (with concurrency limits and retry budgets).

Key considerations:

  • Temperature and top-p control reasoning diversity across generations
  • Multiple generations per problem enable downstream quality filtering
  • The async script supports up to 1000 concurrent requests with automatic retry
  • Distilabel provides group_generations mode for organized multi-generation output

Step 4: Batch_Generation_Execution

Run the generation pipeline across the full dataset. The Distilabel pipeline processes the dataset in batches with Ray-based parallelism, while the async script sends concurrent requests to the vLLM API with progress tracking. Both approaches support resumable processing: Distilabel via its caching mechanism, and the async script via UUID-based deduplication of already-processed examples.

Key considerations:

  • Generation is the most time-consuming step and benefits from batched processing
  • The async script writes results incrementally to JSONL for crash recovery
  • Request timeouts should be generous (600-900 seconds) for long reasoning traces
  • Progress is tracked via tqdm with active task count monitoring

Step 5: Dataset_Publishing

Push the generated dataset to the HuggingFace Hub. The Distilabel pipeline produces a Distiset object that can be pushed directly. The async script's JSONL output needs to be loaded and formatted before publishing. The dataset includes all original fields plus the generated reasoning traces, finish reasons, and API metadata.

Key considerations:

  • Generated datasets can be made public or private on the Hub
  • Multiple generation runs can be combined into a single dataset
  • The output preserves original dataset fields alongside generated content

Execution Diagram

GitHub URL

Workflow Repository