Workflow:Allenai Open instruct Tulu3 Full Post Training

Knowledge Sources	Open Instruct Tulu 3 Open Instruct Docs
Domains	LLMs, Post_Training, Fine_Tuning, Preference_Optimization, RLVR
Last Updated	2026-02-07 00:00 GMT

Overview

End-to-end three-stage post-training pipeline (SFT, DPO, RLVR) for reproducing Tulu 3 instruction-following models from base pretrained LLMs.

Description

This workflow documents the complete Tulu 3 post-training recipe that transforms a base pretrained language model into a high-quality instruction-following model. The pipeline consists of three sequential stages: (1) Supervised Fine-Tuning on a curated instruction mixture, (2) Direct Preference Optimization on human preference data, and (3) Reinforcement Learning with Verifiable Rewards on tasks with ground-truth answers. Each stage produces a checkpoint that becomes the starting model for the next stage. The pipeline has been validated for Llama 3.1 (8B, 70B, 405B) and OLMo 2 (7B, 13B, 32B) model families.

This is the master workflow that orchestrates the individual SFT, DPO, and GRPO workflows into a complete production pipeline.

Usage

Execute this workflow when you want to reproduce a complete Tulu 3 model from a base pretrained checkpoint, or when building a custom post-training pipeline that follows the same three-stage pattern. You need access to the Tulu 3 data mixtures (SFT mixture, preference mixture, RLVR prompts) and sufficient GPU compute for multi-node training.

Execution Steps

Step 1: Infrastructure_Preparation

Set up the compute environment for all three training stages. This includes installing dependencies, building Docker images, and ensuring access to the required datasets on HuggingFace. For AI2 Beaker runs, the build_image_and_launch.sh script automates Docker build and job submission.

Key considerations:

All three stages share the same codebase and Docker image
The repository must be at a clean git commit before building the Docker image
GPU requirements vary by stage: SFT typically uses 8 nodes, DPO uses 4 nodes, RLVR uses 1-2 nodes
The mason.py CLI handles Beaker experiment submission with cluster, priority, and budget settings

Step 2: SFT_Training

Run supervised fine-tuning on the base pretrained model using the Tulu 3 SFT data mixture. This stage uses finetune.py with Accelerate and DeepSpeed to train the model on instruction-following conversations for 2 epochs. The SFT mixture includes diverse datasets covering coding, math, general knowledge, safety, and multilingual tasks.

Key considerations:

Uses the allenai/tulu-3-sft-mixture dataset
Training typically runs for 2 epochs with linear LR scheduling
Effective batch size must be preserved when scaling to different GPU counts
The output SFT checkpoint feeds into both the DPO stage and the reward model training

Step 3: Reward_Model_Training

Train a reward model from the SFT checkpoint on preference data. This reward model scores response quality and is used during the RLVR stage. The model adds a scalar value head to the SFT architecture and is trained with pairwise ranking loss.

Key considerations:

Uses the same preference mixture as DPO training
Produces a separate reward model checkpoint (e.g., Llama-3.1-Tulu-3-8B-RM)
For GRPO with verifiable rewards, the reward model multiplier is typically set to 0.0
The reward model is optional if using purely verifiable rewards in GRPO

Step 4: DPO_Training

Run Direct Preference Optimization on the SFT checkpoint using the Tulu 3 preference mixture. This stage aligns the model with human preferences by training on chosen/rejected response pairs using the dpo_norm loss function (length-normalized DPO).

Key considerations:

Takes the SFT checkpoint as both the training model and reference model
Uses the allenai/llama-3.1-tulu-3-8b-preference-mixture dataset
DPO typically trains for 1 epoch with a lower learning rate than SFT
The DPO checkpoint feeds into the RLVR stage as the starting policy

Step 5: RLVR_Training

Run reinforcement learning with verifiable rewards on the DPO checkpoint. This stage uses GRPO (grpo_fast.py) with tasks that have ground-truth verifiable answers, including math problems and instruction-following constraints. The model generates multiple responses per prompt, scores them with verifiable reward functions, and updates the policy using the group-relative advantage.

Key considerations:

Takes the DPO checkpoint as the starting policy and reference model
Uses RLVR-GSM-MATH-IF-Mixed-Constraints for math and instruction following
vLLM inference workers generate responses asynchronously alongside training
This is the most computationally intensive stage but produces the final model

Step 6: Evaluation_and_Release

Evaluate the final model across multiple benchmarks (MMLU, GSM8K, MATH, BBH, IFEval, AlpacaEval) and release the checkpoint to HuggingFace Hub. Evaluation can be performed using the OLMES evaluation framework or the built-in evaluation scripts via Beaker.

Key considerations:

Evaluation is typically run on Beaker using submit_eval_jobs.py
The OLMES framework provides standardized benchmark evaluation
Results should be compared against reference Tulu 3 checkpoints for validation
The final model is uploaded to HuggingFace with appropriate metadata

Execution Diagram

GitHub URL

Workflow Repository