Implementation:Hiyouga LLaMA Factory V1 Launcher

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	CLI, Distributed Training, DevOps
Last Updated	2026-02-06 19:00 GMT

Overview

launcher.py is the top-level CLI entry point and distributed training launcher for the LLaMA-Factory v1 system, managing command routing, multi-GPU detection, and torchrun-based distributed execution.

Description

The launcher module provides two main functions: launch() and main(). The launch() function serves as the initial CLI entry point that parses commands (sft, dpo, rm, chat, help, version, env), auto-detects multi-GPU setups, and re-launches the process via torchrun for distributed training with elastic launch support (configurable via RDZV_ID, MIN_NNODES, MAX_NNODES environment variables). It supports multi-node training through NNODES, NODE_RANK, MASTER_ADDR, and MASTER_PORT environment variables. The main() function handles the torchrun-spawned worker process, routing the command to the appropriate trainer (SFT, DPO, or RM). Environment optimizations for CUDA memory allocation and NCCL are applied when OPTIM_TORCH is enabled.

Usage

Use via the llamafactory-cli (or lmf) command-line tool. For training, run commands like llamafactory-cli sft config.yaml. For single-GPU training, the launcher runs the trainer directly. For multi-GPU, it automatically wraps execution with torchrun. For chat, use llamafactory-cli chat.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/v1/launcher.py
Lines: 1-179

Signature

def launch() -> None: ...
def main() -> None: ...

# Module-level constants
USAGE: str  # Help text string
_DIST_TRAIN_COMMANDS: tuple  # ("train", "sft", "dpo", "rm")

Import

from llamafactory.v1.launcher import launch, main

I/O Contract

Inputs

Name	Type	Required	Description
sys.argv	list[str]	Yes	Command-line arguments. argv[1] is the command (sft/dpo/rm/chat/help/version/env), followed by config or args.
NNODES	env var	No	Number of nodes for multi-node training (default: "1").
NODE_RANK	env var	No	Rank of the current node (default: "0").
NPROC_PER_NODE	env var	No	Number of processes per node (default: GPU count).
MASTER_ADDR	env var	No	Master node address (default: "127.0.0.1").
MASTER_PORT	env var	No	Master node port (default: auto-detected available port).
FORCE_TORCHRUN	env var	No	Force torchrun even on single GPU.
OPTIM_TORCH	env var	No	Enable CUDA/NCCL optimizations (default: "1").
MAX_RESTARTS	env var	No	Maximum restarts for elastic launch (default: "0").
RDZV_ID	env var	No	Rendezvous ID for elastic launch. When set, enables elastic job mode.
MIN_NNODES	env var	No	Minimum number of nodes for elastic scaling.
MAX_NNODES	env var	No	Maximum number of nodes for elastic scaling.

Outputs

Name	Type	Description
Process exit code	int	0 on success, non-zero on failure. For distributed training, returns the torchrun exit code.

Usage Examples

# CLI usage (shell commands)
# Single-GPU SFT training
# llamafactory-cli sft config.yaml

# Multi-GPU auto-detected distributed training
# llamafactory-cli sft config.yaml  (auto-launches torchrun if >1 GPU)

# Multi-node training
# NNODES=2 NODE_RANK=0 MASTER_ADDR=10.0.0.1 llamafactory-cli sft config.yaml

# Elastic launch
# RDZV_ID=my_job MIN_NNODES=1 MAX_NNODES=4 llamafactory-cli sft config.yaml

# Interactive chat
# llamafactory-cli chat --model path/to/model

# Direct Python usage
from llamafactory.v1.launcher import launch
launch()

Related Pages

Hiyouga_LLaMA_Factory_V1_Base_Trainer - The trainer classes that launch() routes to.
Hiyouga_LLaMA_Factory_V1_Model_Engine - Model initialization triggered during training.
Hiyouga_LLaMA_Factory_V1_Data_Engine - Data loading triggered during training.
Hiyouga_LLaMA_Factory_V1_Base_Sampler - The sampler used for the chat command.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment