Principle:Alibaba ROLL MCoreAdapter Training Entry Point
| Knowledge Sources | |
|---|---|
| Domains | Training, CLI_Tools |
| Last Updated | 2026-02-07 20:00 GMT |
Overview
A routing entry point that dispatches training to the appropriate stage handler (pre-training, supervised fine-tuning, or preference optimization) with support for both distributed-parallel and standard backends.
Description
Training large language models typically proceeds through multiple stages: pre-training (PT) on large unlabeled corpora, supervised fine-tuning (SFT) on instruction-following data, and preference optimization (DPO/ORPO) on human preference pairs. Each stage requires a different data pipeline, loss function, and potentially a different model configuration (e.g., DPO requires a frozen reference model).
This principle defines a unified entry point that:
- Argument Parsing: Uses a multi-dataclass argument parser to simultaneously parse training arguments, model arguments, data arguments, fine-tuning arguments, and backend selection arguments. This allows a single command line to fully specify the training run.
- Stage Routing: Based on the fine-tuning stage parameter (pt, sft, dpo), the entry point dispatches to the appropriate training function. Each function sets up the model, dataset, data collator, and trainer for that specific stage.
- Backend Selection: A use_mca flag selects between the distributed-parallel backend (using Megatron-Core with tensor/pipeline/expert parallelism) and a standard backend (using HuggingFace Trainer or LLaMA-Factory). This enables the same codebase to be used for both large-scale distributed training and single-GPU development.
- LoRA Integration: For parameter-efficient fine-tuning, the entry point applies LoRA adapters to the model after loading, marking expert layers appropriately and casting trainable parameters to float32 for stability.
- Data Pipeline Adaptation: The entry point wraps data collators to shift labels by one position (aligning inputs and targets for causal language modeling) and configures padding strategies based on the parallelism mode (max-length padding for expert parallelism, dynamic padding otherwise).
Usage
Use this principle when:
- Building a CLI tool that must support multiple training stages (PT, SFT, DPO) through a single command with stage selection.
- The system must support both distributed-parallel and standard training backends with a common interface.
- LoRA fine-tuning must be optionally applied at the entry point level, before the training loop begins.
Theoretical Basis
Stage dispatch logic:
PARSE args: (training_args, model_args, data_args, finetuning_args, use_mca_args)
IF use_mca:
model = AutoModel.from_pretrained(path, training_args)
IF finetuning_type == "lora":
apply_megatron_lora()
set_linear_is_expert(model)
model = get_peft_model(model, lora_config)
SWITCH finetuning_args.stage:
CASE "pt": pt_mca_train(...)
CASE "sft": sft_mca_train(...)
CASE "dpo": dpo_mca_train(...)
ELSE:
SWITCH finetuning_args.stage:
CASE "pt": run_pt(...) # LLaMA-Factory
CASE "sft": run_sft(...)
CASE "dpo": run_dpo(...)
Label shifting for causal LM:
The data collator wrapper shifts inputs and labels:
features["labels"] = features["labels"][1:] # shift left features["input_ids"] = features["input_ids"][:-1] # drop last features["attention_mask"] = features["attention_mask"][:-1]
This ensures that at each position , the model predicts token given tokens .
DPO-specific setup:
model = AutoModel.from_pretrained(path, training_args)
IF use_ref_model:
ref_config = AutoConfig.from_pretrained(path, training_args)
ref_model = AutoModel.from_config(ref_config)
ref_model.load_state_dict(model.state_dict()) # copy weights
ref_model.eval() # freeze
trainer = DPOTrainer(model, ref_model, dpo_config, ...)