Principle:Alibaba ROLL MCoreAdapter Training Entry Point

Knowledge Sources	Alibaba_ROLL
Domains	Training, CLI_Tools
Last Updated	2026-02-07 20:00 GMT

Overview

A routing entry point that dispatches training to the appropriate stage handler (pre-training, supervised fine-tuning, or preference optimization) with support for both distributed-parallel and standard backends.

Description

Training large language models typically proceeds through multiple stages: pre-training (PT) on large unlabeled corpora, supervised fine-tuning (SFT) on instruction-following data, and preference optimization (DPO/ORPO) on human preference pairs. Each stage requires a different data pipeline, loss function, and potentially a different model configuration (e.g., DPO requires a frozen reference model).

This principle defines a unified entry point that:

Argument Parsing: Uses a multi-dataclass argument parser to simultaneously parse training arguments, model arguments, data arguments, fine-tuning arguments, and backend selection arguments. This allows a single command line to fully specify the training run.

Stage Routing: Based on the fine-tuning stage parameter (pt, sft, dpo), the entry point dispatches to the appropriate training function. Each function sets up the model, dataset, data collator, and trainer for that specific stage.

Backend Selection: A use_mca flag selects between the distributed-parallel backend (using Megatron-Core with tensor/pipeline/expert parallelism) and a standard backend (using HuggingFace Trainer or LLaMA-Factory). This enables the same codebase to be used for both large-scale distributed training and single-GPU development.

LoRA Integration: For parameter-efficient fine-tuning, the entry point applies LoRA adapters to the model after loading, marking expert layers appropriately and casting trainable parameters to float32 for stability.

Data Pipeline Adaptation: The entry point wraps data collators to shift labels by one position (aligning inputs and targets for causal language modeling) and configures padding strategies based on the parallelism mode (max-length padding for expert parallelism, dynamic padding otherwise).

Usage

Use this principle when:

Building a CLI tool that must support multiple training stages (PT, SFT, DPO) through a single command with stage selection.
The system must support both distributed-parallel and standard training backends with a common interface.
LoRA fine-tuning must be optionally applied at the entry point level, before the training loop begins.

Theoretical Basis

Stage dispatch logic:

PARSE args: (training_args, model_args, data_args, finetuning_args, use_mca_args)

IF use_mca:
    model = AutoModel.from_pretrained(path, training_args)
    IF finetuning_type == "lora":
        apply_megatron_lora()
        set_linear_is_expert(model)
        model = get_peft_model(model, lora_config)
    SWITCH finetuning_args.stage:
        CASE "pt":  pt_mca_train(...)
        CASE "sft": sft_mca_train(...)
        CASE "dpo": dpo_mca_train(...)
ELSE:
    SWITCH finetuning_args.stage:
        CASE "pt":  run_pt(...)    # LLaMA-Factory
        CASE "sft": run_sft(...)
        CASE "dpo": run_dpo(...)

Label shifting for causal LM:

The data collator wrapper shifts inputs and labels:

features["labels"] = features["labels"][1:]    # shift left
features["input_ids"] = features["input_ids"][:-1]  # drop last
features["attention_mask"] = features["attention_mask"][:-1]

This ensures that at each position $t$ , the model predicts token $t + 1$ given tokens $[1, \dots, t]$ .

DPO-specific setup:

model = AutoModel.from_pretrained(path, training_args)
IF use_ref_model:
    ref_config = AutoConfig.from_pretrained(path, training_args)
    ref_model = AutoModel.from_config(ref_config)
    ref_model.load_state_dict(model.state_dict())  # copy weights
    ref_model.eval()  # freeze
trainer = DPOTrainer(model, ref_model, dpo_config, ...)

Related Pages

Implementation:Alibaba_ROLL_Run_Train

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment