Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Alibaba ROLL Supervised Finetuning Pipeline

From Leeroopedia



Knowledge Sources
Domains LLMs, Fine_Tuning, Distributed_Training
Last Updated 2026-02-07 19:00 GMT

Overview

End-to-end process for supervised fine-tuning of Large Language Models on instruction-response datasets using distributed training with advanced parallelism strategies.

Description

This workflow implements the SFT (Supervised Fine-Tuning) pipeline in the ROLL framework. It trains a base or pre-trained LLM on instruction-following data by computing cross-entropy loss on response tokens while masking prompt tokens. The pipeline leverages ROLL's distributed infrastructure to support advanced parallelism strategies including tensor parallelism, pipeline parallelism, context parallelism, and sequence packing for efficient training on large models across multiple GPUs.

Usage

Execute this workflow when you have a base LLM (e.g., Qwen2.5-7B) and an instruction-response dataset (e.g., code instructions, chat data), and you want to fine-tune the model to follow instructions before applying further alignment techniques such as RLVR or DPO.

Execution Steps

Step 1: Environment Setup and Configuration

Prepare the compute environment and define the Hydra YAML configuration specifying the base model path, dataset location, training hyperparameters, and distributed strategy. Configure parallelism dimensions (tensor parallel, pipeline parallel, context parallel, sequence parallel) based on model size and available GPUs.

Key considerations:

  • Megatron-Core backend supports full 5D parallelism for large models
  • Sequence packing can be enabled to concatenate short sequences and reduce padding waste
  • Configure prompt_key, query_key, and response_key to match your dataset's field names

Step 2: Dataset Preparation

Prepare the instruction-response dataset in a supported format (JSON, JSONL). Each example should contain an instruction (prompt), optional input context, and the target response. The pipeline tokenizes data using the model's chat template and creates labels that mask prompt tokens with IGNORE_INDEX so loss is computed only on response tokens.

What happens:

  • Raw examples are mapped through the chat template to create properly formatted sequences
  • Labels are constructed: IGNORE_INDEX for all prompt tokens, actual token IDs for response tokens
  • Sequences exceeding the configured sequence_length are truncated
  • Optional sequence packing groups multiple short examples into single training sequences

Step 3: Distributed Worker Initialization

Launch the Ray cluster and initialize the SFT worker cluster with the configured training strategy. Workers load the base model and set up the optimizer, learning rate scheduler, and gradient accumulation configuration.

Key considerations:

  • SFT uses a single worker type (no separate inference, reference, or reward workers)
  • The training strategy handles model sharding across GPUs based on parallelism configuration
  • Gradient checkpointing can be enabled to trade compute for memory on large models

Step 4: Training Loop

Iterate over the dataset in batches, computing cross-entropy loss on response tokens and updating model parameters. Data is rebalanced across data-parallel ranks to ensure even distribution. The training loop supports multiple epochs with configurable learning rate scheduling.

What happens:

  • Each batch is loaded and distributed across DP ranks
  • Forward pass computes logits for all tokens
  • Loss is computed only on response tokens (prompt tokens are masked)
  • Gradients are accumulated across micro-batches and reduced across DP ranks
  • Optimizer step applies parameter updates with learning rate scheduling

Step 5: Validation and Checkpointing

Periodically evaluate on a held-out validation set by computing validation loss. Save model checkpoints at configured intervals. Log training metrics (loss, learning rate, gradient norms) to the tracking backend.

Key considerations:

  • Validation loss monitors overfitting across epochs
  • Megatron checkpoints can be converted to HuggingFace format using the provided conversion tool
  • Checkpoints include full optimizer state for training resumption

Execution Diagram

GitHub URL

Workflow Repository