Workflow:Allenai Open instruct Tulu3 Full Post Training
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Post_Training, Fine_Tuning, Preference_Optimization, RLVR |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
End-to-end three-stage post-training pipeline (SFT, DPO, RLVR) for reproducing Tulu 3 instruction-following models from base pretrained LLMs.
Description
This workflow documents the complete Tulu 3 post-training recipe that transforms a base pretrained language model into a high-quality instruction-following model. The pipeline consists of three sequential stages: (1) Supervised Fine-Tuning on a curated instruction mixture, (2) Direct Preference Optimization on human preference data, and (3) Reinforcement Learning with Verifiable Rewards on tasks with ground-truth answers. Each stage produces a checkpoint that becomes the starting model for the next stage. The pipeline has been validated for Llama 3.1 (8B, 70B, 405B) and OLMo 2 (7B, 13B, 32B) model families.
This is the master workflow that orchestrates the individual SFT, DPO, and GRPO workflows into a complete production pipeline.
Usage
Execute this workflow when you want to reproduce a complete Tulu 3 model from a base pretrained checkpoint, or when building a custom post-training pipeline that follows the same three-stage pattern. You need access to the Tulu 3 data mixtures (SFT mixture, preference mixture, RLVR prompts) and sufficient GPU compute for multi-node training.
Execution Steps
Step 1: Infrastructure_Preparation
Set up the compute environment for all three training stages. This includes installing dependencies, building Docker images, and ensuring access to the required datasets on HuggingFace. For AI2 Beaker runs, the build_image_and_launch.sh script automates Docker build and job submission.
Key considerations:
- All three stages share the same codebase and Docker image
- The repository must be at a clean git commit before building the Docker image
- GPU requirements vary by stage: SFT typically uses 8 nodes, DPO uses 4 nodes, RLVR uses 1-2 nodes
- The mason.py CLI handles Beaker experiment submission with cluster, priority, and budget settings
Step 2: SFT_Training
Run supervised fine-tuning on the base pretrained model using the Tulu 3 SFT data mixture. This stage uses finetune.py with Accelerate and DeepSpeed to train the model on instruction-following conversations for 2 epochs. The SFT mixture includes diverse datasets covering coding, math, general knowledge, safety, and multilingual tasks.
Key considerations:
- Uses the allenai/tulu-3-sft-mixture dataset
- Training typically runs for 2 epochs with linear LR scheduling
- Effective batch size must be preserved when scaling to different GPU counts
- The output SFT checkpoint feeds into both the DPO stage and the reward model training
Step 3: Reward_Model_Training
Train a reward model from the SFT checkpoint on preference data. This reward model scores response quality and is used during the RLVR stage. The model adds a scalar value head to the SFT architecture and is trained with pairwise ranking loss.
Key considerations:
- Uses the same preference mixture as DPO training
- Produces a separate reward model checkpoint (e.g., Llama-3.1-Tulu-3-8B-RM)
- For GRPO with verifiable rewards, the reward model multiplier is typically set to 0.0
- The reward model is optional if using purely verifiable rewards in GRPO
Step 4: DPO_Training
Run Direct Preference Optimization on the SFT checkpoint using the Tulu 3 preference mixture. This stage aligns the model with human preferences by training on chosen/rejected response pairs using the dpo_norm loss function (length-normalized DPO).
Key considerations:
- Takes the SFT checkpoint as both the training model and reference model
- Uses the allenai/llama-3.1-tulu-3-8b-preference-mixture dataset
- DPO typically trains for 1 epoch with a lower learning rate than SFT
- The DPO checkpoint feeds into the RLVR stage as the starting policy
Step 5: RLVR_Training
Run reinforcement learning with verifiable rewards on the DPO checkpoint. This stage uses GRPO (grpo_fast.py) with tasks that have ground-truth verifiable answers, including math problems and instruction-following constraints. The model generates multiple responses per prompt, scores them with verifiable reward functions, and updates the policy using the group-relative advantage.
Key considerations:
- Takes the DPO checkpoint as the starting policy and reference model
- Uses RLVR-GSM-MATH-IF-Mixed-Constraints for math and instruction following
- vLLM inference workers generate responses asynchronously alongside training
- This is the most computationally intensive stage but produces the final model
Step 6: Evaluation_and_Release
Evaluate the final model across multiple benchmarks (MMLU, GSM8K, MATH, BBH, IFEval, AlpacaEval) and release the checkpoint to HuggingFace Hub. Evaluation can be performed using the OLMES evaluation framework or the built-in evaluation scripts via Beaker.
Key considerations:
- Evaluation is typically run on Beaker using submit_eval_jobs.py
- The OLMES framework provides standardized benchmark evaluation
- Results should be compared against reference Tulu 3 checkpoints for validation
- The final model is uploaded to HuggingFace with appropriate metadata