Implementation:Microsoft DeepSpeedExamples LLaMA Tensor Parallel Finetuning

Knowledge Sources	Microsoft_DeepSpeedExamples
Domains	Tensor Parallelism, Large Language Models
Last Updated	2026-02-07 12:00 GMT

Overview

Implements LLaMA-style causal language model fine-tuning using DeepSpeed with tensor parallelism, including supervised dataset preprocessing, tokenizer embedding resizing, and memory usage monitoring.

Description

This script provides an end-to-end supervised fine-tuning pipeline for causal language models (e.g., LLaMA, OPT) with DeepSpeed tensor parallelism support. It defines dataclass-based argument parsing for ModelArguments, DataArguments, and TrainingArguments (extending HuggingFace TrainingArguments), and uses Alpaca-style instruction prompt templates with configurable input/no-input formats.

The data pipeline includes smart_tokenizer_and_embedding_resize for safely adding special tokens (PAD, EOS, BOS, UNK) and resizing model embeddings with mean initialization for new tokens. The SupervisedDataset class loads JSON data, formats prompts, and tokenizes with longest padding strategy. The preprocess function masks source tokens in labels using IGNORE_INDEX (-100) to ensure the loss is computed only on target tokens. A DataCollatorForSupervisedDataset handles dynamic batch padding.

The train function orchestrates model loading via AutoModelForCausalLM, tokenizer configuration with right-padding, dataset preparation, and training via the HuggingFace Trainer. It includes a MemoryCallback that logs GPU memory allocation, peak memory, and CPU virtual memory usage after each training step. Dataset tokenization results are cached to a pickle file for faster subsequent runs.

Usage

Use this script for fine-tuning LLaMA or similar causal LMs on instruction-following datasets with DeepSpeed tensor parallelism. It is designed to be launched with the DeepSpeed distributed launcher and a corresponding DeepSpeed configuration file that enables automatic tensor parallelism.

Code Reference

Source Location

Repository: Microsoft_DeepSpeedExamples
File: training/tensor_parallel/train.py
Lines: 1-269

Signature

@dataclass
class ModelArguments:
    model_name_or_path: Optional[str] = field(default="facebook/opt-125m")

@dataclass
class DataArguments:
    data_path: str = field(default=None)

@dataclass
class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)
    optim: str = field(default="adamw_torch")
    model_max_length: int = field(default=512)

def smart_tokenizer_and_embedding_resize(
    special_tokens_dict: Dict,
    tokenizer: transformers.PreTrainedTokenizer,
    model: transformers.PreTrainedModel,
):
    ...

class SupervisedDataset(Dataset):
    def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer):
        ...

@dataclass
class DataCollatorForSupervisedDataset(object):
    tokenizer: transformers.PreTrainedTokenizer
    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        ...

def make_supervised_data_module(tokenizer, data_args) -> Dict:
    ...

def train():
    ...

Import

from train import SupervisedDataset, DataCollatorForSupervisedDataset, smart_tokenizer_and_embedding_resize, train

I/O Contract

Inputs

Name	Type	Required	Description
--model_name_or_path	str	No	HuggingFace model identifier or path (default: 'facebook/opt-125m')
--data_path	str	Yes	Path to the JSON training data file in Alpaca format
--cache_dir	str	No	Cache directory for model and tokenizer downloads
--optim	str	No	Optimizer name (default: 'adamw_torch')
--model_max_length	int	No	Maximum sequence length for tokenization (default: 512)
--output_dir	str	Yes	Directory to save the trained model (inherited from TrainingArguments)

Outputs

Name	Type	Description
saved model	directory	HuggingFace model checkpoint saved to output_dir
dataset_dict.pkl	file	Cached tokenized dataset for faster reloading
stdout	text	Memory usage statistics after each training step

Usage Examples

# Launch with DeepSpeed tensor parallelism
# deepspeed train.py --model_name_or_path meta-llama/Llama-2-7b-hf \
#     --data_path ./alpaca_data.json \
#     --output_dir ./output \
#     --model_max_length 512 \
#     --deepspeed ds_config.json

# Programmatic dataset usage
import transformers
from train import SupervisedDataset, make_supervised_data_module

tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/opt-125m")
data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
# Returns dict with train_dataset, eval_dataset, data_collator

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment