Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft BIPIA HF Trainer For Defense

From Leeroopedia
Field Value
Sources Repo, Doc: HuggingFace Trainer
Domains NLP, Distributed_Training, Defense
Last Updated 2026-02-14

Overview

Concrete tool for distributed defense finetuning using HuggingFace Trainer with DeepSpeed ZeRO Stage 3 provided by the BIPIA defense module, wrapping the transformers Trainer API.

Description

The train() function in finetune.py creates a HuggingFace Trainer with the prepared model, tokenized dataset, and DataCollatorWithPaddingAndLabel (which extends DataCollatorWithPadding to handle label padding with IGNORE_TOKEN_ID). It validates model_structure == "special_token", initializes W&B logging, calls trainer.train(), then saves the model via safe_save_model_for_hf_trainer() which collects state_dict to CPU and saves. DeepSpeed config (ds_config.json) sets ZeRO Stage 3 with bf16, pin_memory, and gradient_accumulation_steps. This is a Wrapper Doc around HuggingFace's Trainer.

Usage

Run via torchrun/deepspeed:

torchrun --nproc_per_node=8 defense/white_box/finetune.py \
    --model_structure special_token \
    --llm_config_file config/vicuna_13b.yaml \
    --deepspeed defense/white_box/ds_config.json \
    ...

Code Reference

Source
BIPIA repo
Files
  • defense/white_box/finetune.py (L477-549, train function; L162-168, safe_save_model)
  • defense/white_box/utils.py (L95-119, DataCollatorWithPaddingAndLabel)
  • defense/white_box/ds_config.json (DeepSpeed config)
Signatures
def train() -> None
Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=dataset,
    data_collator=DataCollatorWithPaddingAndLabel(tokenizer),
)
def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str) -> None
Import
from transformers import Trainer

I/O Contract

Inputs

Name Type Required Description
model PreTrainedModel Yes Model with resized embeddings for special tokens
training_args TrainingArguments Yes lr, epochs, batch_size, output_dir, deepspeed config path
train_dataset Dataset Yes Tokenized dataset with input_ids, attention_mask, labels

Outputs

Output Description
Saved model checkpoint state_dict collected on CPU, saved at output_dir
trainer_state.json Trainer state metadata and training history
W&B logs Weights & Biases experiment tracking logs

Usage Examples

CLI invocation with key arguments:

torchrun --nproc_per_node=8 defense/white_box/finetune.py \
    --model_structure special_token \
    --llm_config_file config/vicuna_13b.yaml \
    --deepspeed defense/white_box/ds_config.json \
    --output_dir output/vicuna_13b_defense \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --learning_rate 2e-5 \
    --bf16 True

DeepSpeed ZeRO Stage 3 config structure (ds_config.json):

{
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "none"
        },
        "offload_param": {
            "device": "none"
        },
        "overlap_comm": true,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "data_sampling": {
        "data_efficiency": {
            "enabled": false
        }
    }
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment