Workflow:OpenRLHF OpenRLHF SFT Training
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, SFT |
| Last Updated | 2026-02-07 10:00 GMT |
Overview
End-to-end process for supervised fine-tuning (SFT) of large language models on instruction-response datasets using DeepSpeed distributed training.
Description
This workflow covers the standard procedure for adapting a pretrained base model to follow instructions. It loads a pretrained model (optionally with LoRA adapters for parameter efficiency), tokenizes an instruction-response dataset with configurable chat templates, and trains using DeepSpeed ZeRO parallelism. The process supports sample packing for efficient GPU utilization, gradient checkpointing for memory savings, and flash attention for training speed. The output is a fine-tuned model checkpoint ready for downstream tasks such as reward model training, DPO alignment, or RL-based training.
Usage
Execute this workflow when you have an instruction-response dataset (e.g., OpenOrca, Alpaca-style) and need to adapt a base language model to follow instructions. This is typically the first stage in the RLHF pipeline, producing an SFT model that serves as the starting point for reward modeling and policy optimization. It also supports LoRA/QLoRA for training large models on limited GPU memory.
Execution Steps
Step 1: Configure distributed strategy
Initialize the DeepSpeed training strategy with the desired parallelism configuration (ZeRO stage 2 or 3), precision settings (bf16/fp16), and gradient accumulation parameters. This sets up the distributed process group and communication backend.
Key considerations:
- ZeRO-2 is typical for SFT; ZeRO-3 for very large models
- bf16 mixed precision is recommended for modern GPUs
- Gradient accumulation steps are auto-calculated from global and micro batch sizes
Step 2: Load pretrained model
Load the base language model from a HuggingFace checkpoint. Optionally configure LoRA adapters with specified rank and alpha for parameter-efficient fine-tuning. Enable flash attention and gradient checkpointing as needed.
Key considerations:
- LoRA reduces trainable parameters dramatically (e.g., rank 64 on Mixtral 8x7B)
- 4-bit quantization (QLoRA) can be enabled for extreme memory savings
- Flash attention requires compatible GPU architecture
Step 3: Prepare tokenizer and dataset
Load the tokenizer with appropriate chat template settings. Blend multiple datasets if specified (weighted mixing). Create the SFT dataset which tokenizes instruction-response pairs, applies chat templates, and creates loss masks that only compute loss on the response tokens.
Key considerations:
- Chat templates must match the model family (Llama, Mistral, etc.)
- Sample packing concatenates multiple short examples into one sequence for efficiency
- Loss masking ensures the model is only trained to predict response tokens, not prompts
Step 4: Setup optimizer and scheduler
Configure the optimizer (typically AdamW) with learning rate and weight decay. Set up the learning rate scheduler (cosine annealing is default) with warmup steps.
Key considerations:
- Typical learning rates range from 2e-6 to 5e-6 for SFT
- Cosine scheduler with warmup ratio of 0.03 is standard
- Max gradient norm clipping prevents training instability
Step 5: Train the model
Execute the SFT training loop. For each epoch, iterate through batches computing cross-entropy loss on response tokens. DeepSpeed handles gradient synchronization, mixed precision, and optimizer steps across distributed workers. Periodic evaluation on held-out data tracks loss convergence.
Key considerations:
- Monitor training loss for convergence
- Evaluate on held-out set to detect overfitting
- Save intermediate checkpoints for recovery
Step 6: Save model checkpoint
Save the trained model weights and tokenizer to the specified output path. For LoRA training, saves only the adapter weights (which can later be merged with the base model).
Key considerations:
- Full model saves include all weights; LoRA saves only adapter deltas
- Use the LoRA combiner utility to merge adapters with the base model post-training