Workflow:Huggingface Open r1 SFT Distillation

Knowledge Sources	Open R1 TRL Documentation Transformers Documentation Open R1 Update 1
Domains	LLMs, Fine_Tuning, Reasoning
Last Updated	2026-02-08 00:00 GMT

Overview

End-to-end process for supervised fine-tuning (SFT) of language models on reasoning trace datasets to reproduce DeepSeek-R1 distilled model capabilities.

Description

This workflow implements the first step of the Open R1 reproduction plan: distilling reasoning capabilities from a teacher model (DeepSeek-R1) into a smaller student model via supervised fine-tuning. The process takes a base model (e.g., Qwen2.5-Math-7B) and trains it on a curated dataset of reasoning traces (e.g., Mixture-of-Thoughts) that contain step-by-step solutions with think/answer structure. The training uses DeepSpeed ZeRO-3 for distributed training across multiple GPUs, with configurable chat templates, learning rate schedules, and gradient checkpointing for memory efficiency.

Goal: A fine-tuned model that can generate structured reasoning traces matching or exceeding DeepSeek-R1-Distill performance on math, coding, and science benchmarks.

Scope: From a base model and reasoning dataset to a saved, Hub-published model with evaluation metrics.

Strategy: Uses HuggingFace TRL's SFTTrainer with Accelerate-based distributed training, applying chat template formatting to align the model with the expected input/output structure.

Usage

Execute this workflow when you have a curated reasoning trace dataset (instruction-tuning style with think/answer structure) and need to create a model that can reason step-by-step. This is appropriate when you have access to high-quality reasoning data from a teacher model and want to transfer those capabilities to a smaller or different base model. Typical hardware requirement is a node of 8 x H100 GPUs (80GB each).

Execution Steps

Step 1: Environment_Setup

Prepare the Python environment with the required dependencies. This involves creating a virtual environment, installing vLLM and FlashAttention for efficient inference, and installing the open-r1 package with development dependencies. Log into HuggingFace Hub and Weights & Biases for model publishing and experiment tracking.

Key considerations:

Requires CUDA 12.4 for compatibility with vLLM binaries
PyTorch v2.6.0 must be used (installed with vLLM)
Git LFS is needed for pushing models to the Hub

Step 2: Configuration_Preparation

Select or create a YAML configuration file specifying the base model, dataset, training hyperparameters, and infrastructure settings. The configuration defines the model to fine-tune, the dataset to train on, the chat template for formatting, and the Accelerate config for distributed training (DDP, FSDP, or DeepSpeed ZeRO-2/ZeRO-3).

Key considerations:

Chat template must match the target model family (ChatML for most, custom for Qwen/Llama)
EOS token must be aligned with the chat template (e.g., <|im_end|> for Qwen models)
Accelerate config determines the parallelism strategy (ZeRO-3 recommended for 7B+ models)

Step 3: Dataset_Loading

Load the reasoning trace dataset from the HuggingFace Hub. The system supports both single datasets and weighted mixtures of multiple datasets. Column selection ensures only relevant fields (e.g., messages column with conversation turns) are kept. The dataset is expected to contain pre-formatted chat messages with reasoning traces.

Key considerations:

Dataset must have a messages column with conversation-format entries
Dataset mixtures allow blending multiple sources with controlled proportions
A test split can be automatically carved out from the mixture

Step 4: Model_and_Tokenizer_Loading

Load the base model and tokenizer from the HuggingFace Hub. The model is loaded with flash-attention enabled and appropriate dtype settings (bfloat16). If the base model lacks a chat template, ChatML format is automatically applied. The tokenizer's chat template can be overridden via the configuration.

Key considerations:

Flash-attention-2 is recommended for training efficiency
Models without a chat template get ChatML applied automatically
Gradient checkpointing is enabled to reduce memory usage

Step 5: SFT_Training

Launch the SFTTrainer to fine-tune the model on the formatted dataset. Training uses the Accelerate launcher with the specified distributed training configuration. The trainer handles gradient accumulation, mixed-precision training, logging to Weights & Biases, and periodic checkpoint saving. Training can resume from the last checkpoint if interrupted.

Key considerations:

Global batch size should remain constant when scaling GPUs (adjust per-device batch size or gradient accumulation)
The Liger kernel can be enabled for additional training efficiency
Callbacks can push per-checkpoint revisions to the Hub for evaluation

Step 6: Model_Saving_and_Publishing

Save the trained model and tokenizer to the output directory. The generation config is aligned with the tokenizer's EOS token to prevent unbounded generation during inference. A model card is automatically created, and the model is pushed to the HuggingFace Hub if configured to do so.

Key considerations:

The KV cache is re-enabled after training for fast inference
Model card includes training metadata and dataset references
Hub publishing creates a model repository under the configured Hub model ID

Step 7: Evaluation

Optionally evaluate the trained model on standard benchmarks (AIME 2024, MATH-500, GPQA Diamond, LiveCodeBench) using LightEval with vLLM backend. Evaluation uses sampling with configurable temperature and generates multiple responses per query to estimate pass@1 accuracy.

Key considerations:

Evaluation is triggered by benchmark callbacks or run separately via scripts
Large models (30B+) require tensor parallelism for evaluation
Benchmark results are logged and can be compared against DeepSeek baselines

Execution Diagram

GitHub URL

Workflow Repository