Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples LLaMA Tensor Parallel Finetuning

From Leeroopedia


Knowledge Sources
Domains Tensor Parallelism, Large Language Models
Last Updated 2026-02-07 12:00 GMT

Overview

Implements LLaMA-style causal language model fine-tuning using DeepSpeed with tensor parallelism, including supervised dataset preprocessing, tokenizer embedding resizing, and memory usage monitoring.

Description

This script provides an end-to-end supervised fine-tuning pipeline for causal language models (e.g., LLaMA, OPT) with DeepSpeed tensor parallelism support. It defines dataclass-based argument parsing for ModelArguments, DataArguments, and TrainingArguments (extending HuggingFace TrainingArguments), and uses Alpaca-style instruction prompt templates with configurable input/no-input formats.

The data pipeline includes smart_tokenizer_and_embedding_resize for safely adding special tokens (PAD, EOS, BOS, UNK) and resizing model embeddings with mean initialization for new tokens. The SupervisedDataset class loads JSON data, formats prompts, and tokenizes with longest padding strategy. The preprocess function masks source tokens in labels using IGNORE_INDEX (-100) to ensure the loss is computed only on target tokens. A DataCollatorForSupervisedDataset handles dynamic batch padding.

The train function orchestrates model loading via AutoModelForCausalLM, tokenizer configuration with right-padding, dataset preparation, and training via the HuggingFace Trainer. It includes a MemoryCallback that logs GPU memory allocation, peak memory, and CPU virtual memory usage after each training step. Dataset tokenization results are cached to a pickle file for faster subsequent runs.

Usage

Use this script for fine-tuning LLaMA or similar causal LMs on instruction-following datasets with DeepSpeed tensor parallelism. It is designed to be launched with the DeepSpeed distributed launcher and a corresponding DeepSpeed configuration file that enables automatic tensor parallelism.

Code Reference

Source Location

Signature

@dataclass
class ModelArguments:
    model_name_or_path: Optional[str] = field(default="facebook/opt-125m")

@dataclass
class DataArguments:
    data_path: str = field(default=None)

@dataclass
class TrainingArguments(transformers.TrainingArguments):
    cache_dir: Optional[str] = field(default=None)
    optim: str = field(default="adamw_torch")
    model_max_length: int = field(default=512)

def smart_tokenizer_and_embedding_resize(
    special_tokens_dict: Dict,
    tokenizer: transformers.PreTrainedTokenizer,
    model: transformers.PreTrainedModel,
):
    ...

class SupervisedDataset(Dataset):
    def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer):
        ...

@dataclass
class DataCollatorForSupervisedDataset(object):
    tokenizer: transformers.PreTrainedTokenizer
    def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
        ...

def make_supervised_data_module(tokenizer, data_args) -> Dict:
    ...

def train():
    ...

Import

from train import SupervisedDataset, DataCollatorForSupervisedDataset, smart_tokenizer_and_embedding_resize, train

I/O Contract

Inputs

Name Type Required Description
--model_name_or_path str No HuggingFace model identifier or path (default: 'facebook/opt-125m')
--data_path str Yes Path to the JSON training data file in Alpaca format
--cache_dir str No Cache directory for model and tokenizer downloads
--optim str No Optimizer name (default: 'adamw_torch')
--model_max_length int No Maximum sequence length for tokenization (default: 512)
--output_dir str Yes Directory to save the trained model (inherited from TrainingArguments)

Outputs

Name Type Description
saved model directory HuggingFace model checkpoint saved to output_dir
dataset_dict.pkl file Cached tokenized dataset for faster reloading
stdout text Memory usage statistics after each training step

Usage Examples

# Launch with DeepSpeed tensor parallelism
# deepspeed train.py --model_name_or_path meta-llama/Llama-2-7b-hf \
#     --data_path ./alpaca_data.json \
#     --output_dir ./output \
#     --model_max_length 512 \
#     --deepspeed ds_config.json

# Programmatic dataset usage
import transformers
from train import SupervisedDataset, make_supervised_data_module

tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/opt-125m")
data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
# Returns dict with train_dataset, eval_dataset, data_collator

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment