Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Open r1 Synthetic Data Generation

From Leeroopedia


Template:Metadata

Overview

A data synthesis methodology that uses large teacher models to generate reasoning traces at scale via inference pipelines, producing training data for distillation into smaller student models.

Description

Synthetic data generation for LLM training involves running a powerful teacher model (e.g., DeepSeek-R1) on a set of prompts to produce reasoning traces (chain-of-thought completions). These traces capture the teacher's reasoning process, which can then be used to train smaller models via SFT (distillation).

Key design considerations:

  • Inference infrastructure -- vLLM for efficient batched generation
  • Prompt templating -- mapping dataset columns to generation inputs
  • Generation parameters -- temperature, top_p, max_new_tokens
  • Parallelization -- multiple client replicas for throughput
  • Quality control -- decontamination against evaluation benchmarks

Usage

Use when creating training datasets for reasoning model distillation, particularly when large teacher models are available but direct training is too expensive.

Theoretical Basis

The distillation data pipeline follows this flow:

source dataset -> prompt template -> teacher model inference -> reasoning traces -> output dataset

pipeline = create_pipeline(model, prompt_template, generation_params)
for batch in source_dataset:
    formatted = apply_template(batch, prompt_template)
    completions = model.generate(formatted, temperature, max_tokens)
    store(completions)
output_dataset = collect_all_completions()
decontaminate(output_dataset, eval_benchmarks)

The pipeline begins by selecting a source dataset of prompts (e.g., math problems, coding tasks). Each prompt is formatted through a prompt template that maps dataset columns to the expected input format. The teacher model (served via vLLM) performs inference on batches of formatted prompts, producing reasoning traces that include step-by-step chain-of-thought. All completions are collected into an output dataset, which is then decontaminated by removing any samples that overlap with evaluation benchmarks to prevent data leakage.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment