Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Microsoft BIPIA DeepSpeed Finetuning Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Distributed_Training
Last Updated 2026-02-14 15:00 GMT

Overview

Multi-GPU DeepSpeed ZeRO Stage 3 environment with 8x V100 GPUs for white-box defense fine-tuning of LLMs against indirect prompt injection attacks.

Description

This environment provides the distributed training infrastructure for the BIPIA white-box defense fine-tuning pipeline. It uses DeepSpeed ZeRO Stage 3 with CPU optimizer offloading, fp16 mixed precision training, and gradient checkpointing. The training pipeline is built on HuggingFace Transformers Trainer with DeepSpeed integration. The configuration uses cosine learning rate scheduling with warmup, AdamW optimizer, and a maximum sequence length of 2048 tokens.

Usage

Use this environment when running white-box defense fine-tuning experiments. This is required for the HF_Trainer_For_Defense, Load_Bipia_Supervised_Data_Module, Tokenize_Fn, and Smart_Tokenizer_And_Embedding_Resize implementations. The environment is launched via the `deepspeed` CLI command rather than standard Python.

System Requirements

Category Requirement Notes
OS Ubuntu 20.04 LTS Tested configuration
Hardware 8x NVIDIA V100 GPUs Used in paper experiments; A100/H100 also viable
RAM Sufficient for CPU offloading ZeRO Stage 3 offloads optimizer states to CPU
Disk Sufficient for model checkpoints Saves checkpoints every 100 steps with up to 100 total
Python >= 3.8 Required by the bipia package

Dependencies

Python Packages

  • `deepspeed` >= 0.9.5
  • `torch` >= 2.0.1
  • `transformers` >= 4.34.0
  • `accelerate` >= 0.15.0
  • `peft` (any version)
  • `wandb` (for experiment tracking)
  • `datasets` >= 2.8.0
  • `jsonlines` (any version)

Credentials

  • `WANDB_PROJECT`: Weights & Biases project name (configured via TrainingArguments or environment variable).
  • `WANDB_RUN`: Weights & Biases run name (configured via TrainingArguments or environment variable).
  • `auth_token`: HuggingFace token for gated models (configured in YAML config file).

Quick Install

# Install bipia with all dependencies
pip install .

# Verify DeepSpeed installation
ds_report

Code Evidence

DeepSpeed ZeRO Stage 3 configuration from `defense/white_box/ds_config.json:1-18`:

{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "fp16": {
        "enabled": "auto",
        "auto_cast": true
    }
}

Training arguments with gradient checkpointing from `defense/README.md:106-144`:

deepspeed finetune.py \
  --fp16 True --fp16_opt_level O2 \
  --max_steps 1000 \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-5 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type cosine \
  --model_max_length 2048 \
  --gradient_checkpointing True

use_cache disabled for gradient checkpointing from `defense/white_box/finetune.py:516`:

model.config.use_cache = False

Common Errors

Error Message Cause Solution
`CUDA out of memory` during training Batch size too large for available VRAM Reduce `per_device_train_batch_size` or increase `gradient_accumulation_steps`
`RuntimeError: expected scalar type Half but found Float` fp16 dtype mismatch Ensure `--fp16 True --fp16_opt_level O2` is set
`ValueError: Invalid model_structure` Wrong model_structure argument Must be `special_token` (the only supported value)
Checkpoint resume failure No checkpoint-* directories found Training starts from scratch if no checkpoints exist; this is expected behavior

Compatibility Notes

  • DeepSpeed ZeRO Stage 3: Optimizer states are offloaded to CPU with pinned memory for efficiency. This reduces GPU memory requirements but increases CPU RAM usage.
  • fp16 precision: Training uses fp16 with O2 optimization level. The model weights are gathered in 16-bit for saving (`stage3_gather_16bit_weights_on_model_save`).
  • Gradient checkpointing: Enabled by default to reduce memory usage during training. Requires `use_cache=False` on the model config.
  • WandB integration: Training logs are reported to Weights & Biases by default. Configure project and run names via arguments.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment