Principle:Huggingface Open r1 Synthetic Data Generation
Overview
A data synthesis methodology that uses large teacher models to generate reasoning traces at scale via inference pipelines, producing training data for distillation into smaller student models.
Description
Synthetic data generation for LLM training involves running a powerful teacher model (e.g., DeepSeek-R1) on a set of prompts to produce reasoning traces (chain-of-thought completions). These traces capture the teacher's reasoning process, which can then be used to train smaller models via SFT (distillation).
Key design considerations:
- Inference infrastructure -- vLLM for efficient batched generation
- Prompt templating -- mapping dataset columns to generation inputs
- Generation parameters -- temperature, top_p, max_new_tokens
- Parallelization -- multiple client replicas for throughput
- Quality control -- decontamination against evaluation benchmarks
Usage
Use when creating training datasets for reasoning model distillation, particularly when large teacher models are available but direct training is too expensive.
Theoretical Basis
The distillation data pipeline follows this flow:
source dataset -> prompt template -> teacher model inference -> reasoning traces -> output dataset
pipeline = create_pipeline(model, prompt_template, generation_params)
for batch in source_dataset:
formatted = apply_template(batch, prompt_template)
completions = model.generate(formatted, temperature, max_tokens)
store(completions)
output_dataset = collect_all_completions()
decontaminate(output_dataset, eval_benchmarks)
The pipeline begins by selecting a source dataset of prompts (e.g., math problems, coding tasks). Each prompt is formatted through a prompt template that maps dataset columns to the expected input format. The teacher model (served via vLLM) performs inference on batches of formatted prompts, producing reasoning traces that include step-by-step chain-of-thought. All completions are collected into an output dataset, which is then decontaminated by removing any samples that overlap with evaluation benchmarks to prevent data leakage.