Workflow:Alibaba ROLL Supervised Finetuning Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, Distributed_Training |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
End-to-end process for supervised fine-tuning of Large Language Models on instruction-response datasets using distributed training with advanced parallelism strategies.
Description
This workflow implements the SFT (Supervised Fine-Tuning) pipeline in the ROLL framework. It trains a base or pre-trained LLM on instruction-following data by computing cross-entropy loss on response tokens while masking prompt tokens. The pipeline leverages ROLL's distributed infrastructure to support advanced parallelism strategies including tensor parallelism, pipeline parallelism, context parallelism, and sequence packing for efficient training on large models across multiple GPUs.
Usage
Execute this workflow when you have a base LLM (e.g., Qwen2.5-7B) and an instruction-response dataset (e.g., code instructions, chat data), and you want to fine-tune the model to follow instructions before applying further alignment techniques such as RLVR or DPO.
Execution Steps
Step 1: Environment Setup and Configuration
Prepare the compute environment and define the Hydra YAML configuration specifying the base model path, dataset location, training hyperparameters, and distributed strategy. Configure parallelism dimensions (tensor parallel, pipeline parallel, context parallel, sequence parallel) based on model size and available GPUs.
Key considerations:
- Megatron-Core backend supports full 5D parallelism for large models
- Sequence packing can be enabled to concatenate short sequences and reduce padding waste
- Configure prompt_key, query_key, and response_key to match your dataset's field names
Step 2: Dataset Preparation
Prepare the instruction-response dataset in a supported format (JSON, JSONL). Each example should contain an instruction (prompt), optional input context, and the target response. The pipeline tokenizes data using the model's chat template and creates labels that mask prompt tokens with IGNORE_INDEX so loss is computed only on response tokens.
What happens:
- Raw examples are mapped through the chat template to create properly formatted sequences
- Labels are constructed: IGNORE_INDEX for all prompt tokens, actual token IDs for response tokens
- Sequences exceeding the configured sequence_length are truncated
- Optional sequence packing groups multiple short examples into single training sequences
Step 3: Distributed Worker Initialization
Launch the Ray cluster and initialize the SFT worker cluster with the configured training strategy. Workers load the base model and set up the optimizer, learning rate scheduler, and gradient accumulation configuration.
Key considerations:
- SFT uses a single worker type (no separate inference, reference, or reward workers)
- The training strategy handles model sharding across GPUs based on parallelism configuration
- Gradient checkpointing can be enabled to trade compute for memory on large models
Step 4: Training Loop
Iterate over the dataset in batches, computing cross-entropy loss on response tokens and updating model parameters. Data is rebalanced across data-parallel ranks to ensure even distribution. The training loop supports multiple epochs with configurable learning rate scheduling.
What happens:
- Each batch is loaded and distributed across DP ranks
- Forward pass computes logits for all tokens
- Loss is computed only on response tokens (prompt tokens are masked)
- Gradients are accumulated across micro-batches and reduced across DP ranks
- Optimizer step applies parameter updates with learning rate scheduling
Step 5: Validation and Checkpointing
Periodically evaluate on a held-out validation set by computing validation loss. Save model checkpoints at configured intervals. Log training metrics (loss, learning rate, gradient norms) to the tracking backend.
Key considerations:
- Validation loss monitors overfitting across epochs
- Megatron checkpoints can be converted to HuggingFace format using the provided conversion tool
- Checkpoints include full optimizer state for training resumption