Implementation:Microsoft DeepSpeedExamples LLaMA Tensor Parallel Finetuning
| Knowledge Sources | |
|---|---|
| Domains | Tensor Parallelism, Large Language Models |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
Implements LLaMA-style causal language model fine-tuning using DeepSpeed with tensor parallelism, including supervised dataset preprocessing, tokenizer embedding resizing, and memory usage monitoring.
Description
This script provides an end-to-end supervised fine-tuning pipeline for causal language models (e.g., LLaMA, OPT) with DeepSpeed tensor parallelism support. It defines dataclass-based argument parsing for ModelArguments, DataArguments, and TrainingArguments (extending HuggingFace TrainingArguments), and uses Alpaca-style instruction prompt templates with configurable input/no-input formats.
The data pipeline includes smart_tokenizer_and_embedding_resize for safely adding special tokens (PAD, EOS, BOS, UNK) and resizing model embeddings with mean initialization for new tokens. The SupervisedDataset class loads JSON data, formats prompts, and tokenizes with longest padding strategy. The preprocess function masks source tokens in labels using IGNORE_INDEX (-100) to ensure the loss is computed only on target tokens. A DataCollatorForSupervisedDataset handles dynamic batch padding.
The train function orchestrates model loading via AutoModelForCausalLM, tokenizer configuration with right-padding, dataset preparation, and training via the HuggingFace Trainer. It includes a MemoryCallback that logs GPU memory allocation, peak memory, and CPU virtual memory usage after each training step. Dataset tokenization results are cached to a pickle file for faster subsequent runs.
Usage
Use this script for fine-tuning LLaMA or similar causal LMs on instruction-following datasets with DeepSpeed tensor parallelism. It is designed to be launched with the DeepSpeed distributed launcher and a corresponding DeepSpeed configuration file that enables automatic tensor parallelism.
Code Reference
Source Location
- Repository: Microsoft_DeepSpeedExamples
- File: training/tensor_parallel/train.py
- Lines: 1-269
Signature
@dataclass
class ModelArguments:
model_name_or_path: Optional[str] = field(default="facebook/opt-125m")
@dataclass
class DataArguments:
data_path: str = field(default=None)
@dataclass
class TrainingArguments(transformers.TrainingArguments):
cache_dir: Optional[str] = field(default=None)
optim: str = field(default="adamw_torch")
model_max_length: int = field(default=512)
def smart_tokenizer_and_embedding_resize(
special_tokens_dict: Dict,
tokenizer: transformers.PreTrainedTokenizer,
model: transformers.PreTrainedModel,
):
...
class SupervisedDataset(Dataset):
def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer):
...
@dataclass
class DataCollatorForSupervisedDataset(object):
tokenizer: transformers.PreTrainedTokenizer
def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
...
def make_supervised_data_module(tokenizer, data_args) -> Dict:
...
def train():
...
Import
from train import SupervisedDataset, DataCollatorForSupervisedDataset, smart_tokenizer_and_embedding_resize, train
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --model_name_or_path | str | No | HuggingFace model identifier or path (default: 'facebook/opt-125m') |
| --data_path | str | Yes | Path to the JSON training data file in Alpaca format |
| --cache_dir | str | No | Cache directory for model and tokenizer downloads |
| --optim | str | No | Optimizer name (default: 'adamw_torch') |
| --model_max_length | int | No | Maximum sequence length for tokenization (default: 512) |
| --output_dir | str | Yes | Directory to save the trained model (inherited from TrainingArguments) |
Outputs
| Name | Type | Description |
|---|---|---|
| saved model | directory | HuggingFace model checkpoint saved to output_dir |
| dataset_dict.pkl | file | Cached tokenized dataset for faster reloading |
| stdout | text | Memory usage statistics after each training step |
Usage Examples
# Launch with DeepSpeed tensor parallelism
# deepspeed train.py --model_name_or_path meta-llama/Llama-2-7b-hf \
# --data_path ./alpaca_data.json \
# --output_dir ./output \
# --model_max_length 512 \
# --deepspeed ds_config.json
# Programmatic dataset usage
import transformers
from train import SupervisedDataset, make_supervised_data_module
tokenizer = transformers.AutoTokenizer.from_pretrained("facebook/opt-125m")
data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
# Returns dict with train_dataset, eval_dataset, data_collator