Workflow:OpenRLHF OpenRLHF SFT Training

Knowledge Sources	OpenRLHF Hugging Face Transformers DeepSpeed
Domains	LLMs, Fine_Tuning, SFT
Last Updated	2026-02-07 10:00 GMT

Overview

End-to-end process for supervised fine-tuning (SFT) of large language models on instruction-response datasets using DeepSpeed distributed training.

Description

This workflow covers the standard procedure for adapting a pretrained base model to follow instructions. It loads a pretrained model (optionally with LoRA adapters for parameter efficiency), tokenizes an instruction-response dataset with configurable chat templates, and trains using DeepSpeed ZeRO parallelism. The process supports sample packing for efficient GPU utilization, gradient checkpointing for memory savings, and flash attention for training speed. The output is a fine-tuned model checkpoint ready for downstream tasks such as reward model training, DPO alignment, or RL-based training.

Usage

Execute this workflow when you have an instruction-response dataset (e.g., OpenOrca, Alpaca-style) and need to adapt a base language model to follow instructions. This is typically the first stage in the RLHF pipeline, producing an SFT model that serves as the starting point for reward modeling and policy optimization. It also supports LoRA/QLoRA for training large models on limited GPU memory.

Execution Steps

Step 1: Configure distributed strategy

Initialize the DeepSpeed training strategy with the desired parallelism configuration (ZeRO stage 2 or 3), precision settings (bf16/fp16), and gradient accumulation parameters. This sets up the distributed process group and communication backend.

Key considerations:

ZeRO-2 is typical for SFT; ZeRO-3 for very large models
bf16 mixed precision is recommended for modern GPUs
Gradient accumulation steps are auto-calculated from global and micro batch sizes

Step 2: Load pretrained model

Load the base language model from a HuggingFace checkpoint. Optionally configure LoRA adapters with specified rank and alpha for parameter-efficient fine-tuning. Enable flash attention and gradient checkpointing as needed.

Key considerations:

LoRA reduces trainable parameters dramatically (e.g., rank 64 on Mixtral 8x7B)
4-bit quantization (QLoRA) can be enabled for extreme memory savings
Flash attention requires compatible GPU architecture

Step 3: Prepare tokenizer and dataset

Load the tokenizer with appropriate chat template settings. Blend multiple datasets if specified (weighted mixing). Create the SFT dataset which tokenizes instruction-response pairs, applies chat templates, and creates loss masks that only compute loss on the response tokens.

Key considerations:

Chat templates must match the model family (Llama, Mistral, etc.)
Sample packing concatenates multiple short examples into one sequence for efficiency
Loss masking ensures the model is only trained to predict response tokens, not prompts

Step 4: Setup optimizer and scheduler

Configure the optimizer (typically AdamW) with learning rate and weight decay. Set up the learning rate scheduler (cosine annealing is default) with warmup steps.

Key considerations:

Typical learning rates range from 2e-6 to 5e-6 for SFT
Cosine scheduler with warmup ratio of 0.03 is standard
Max gradient norm clipping prevents training instability

Step 5: Train the model

Execute the SFT training loop. For each epoch, iterate through batches computing cross-entropy loss on response tokens. DeepSpeed handles gradient synchronization, mixed precision, and optimizer steps across distributed workers. Periodic evaluation on held-out data tracks loss convergence.

Key considerations:

Monitor training loss for convergence
Evaluate on held-out set to detect overfitting
Save intermediate checkpoints for recovery

Step 6: Save model checkpoint

Save the trained model weights and tokenizer to the specified output path. For LoRA training, saves only the adapter weights (which can later be merged with the base model).

Key considerations:

Full model saves include all weights; LoRA saves only adapter deltas
Use the LoRA combiner utility to merge adapters with the base model post-training

Execution Diagram

GitHub URL

Workflow Repository