Environment:Microsoft BIPIA DeepSpeed Finetuning Environment

Knowledge Sources	Microsoft BIPIA DeepSpeed Documentation
Domains	Infrastructure, Distributed_Training
Last Updated	2026-02-14 15:00 GMT

Overview

Multi-GPU DeepSpeed ZeRO Stage 3 environment with 8x V100 GPUs for white-box defense fine-tuning of LLMs against indirect prompt injection attacks.

Description

This environment provides the distributed training infrastructure for the BIPIA white-box defense fine-tuning pipeline. It uses DeepSpeed ZeRO Stage 3 with CPU optimizer offloading, fp16 mixed precision training, and gradient checkpointing. The training pipeline is built on HuggingFace Transformers Trainer with DeepSpeed integration. The configuration uses cosine learning rate scheduling with warmup, AdamW optimizer, and a maximum sequence length of 2048 tokens.

Usage

Use this environment when running white-box defense fine-tuning experiments. This is required for the HF_Trainer_For_Defense, Load_Bipia_Supervised_Data_Module, Tokenize_Fn, and Smart_Tokenizer_And_Embedding_Resize implementations. The environment is launched via the `deepspeed` CLI command rather than standard Python.

System Requirements

Category	Requirement	Notes
OS	Ubuntu 20.04 LTS	Tested configuration
Hardware	8x NVIDIA V100 GPUs	Used in paper experiments; A100/H100 also viable
RAM	Sufficient for CPU offloading	ZeRO Stage 3 offloads optimizer states to CPU
Disk	Sufficient for model checkpoints	Saves checkpoints every 100 steps with up to 100 total
Python	>= 3.8	Required by the bipia package

Dependencies

Python Packages

`deepspeed` >= 0.9.5
`torch` >= 2.0.1
`transformers` >= 4.34.0
`accelerate` >= 0.15.0
`peft` (any version)
`wandb` (for experiment tracking)
`datasets` >= 2.8.0
`jsonlines` (any version)

Credentials

`WANDB_PROJECT`: Weights & Biases project name (configured via TrainingArguments or environment variable).
`WANDB_RUN`: Weights & Biases run name (configured via TrainingArguments or environment variable).
`auth_token`: HuggingFace token for gated models (configured in YAML config file).

Quick Install

# Install bipia with all dependencies
pip install .

# Verify DeepSpeed installation
ds_report

Code Evidence

DeepSpeed ZeRO Stage 3 configuration from `defense/white_box/ds_config.json:1-18`:

{
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "stage3_gather_16bit_weights_on_model_save": true
    },
    "fp16": {
        "enabled": "auto",
        "auto_cast": true
    }
}

Training arguments with gradient checkpointing from `defense/README.md:106-144`:

deepspeed finetune.py \
  --fp16 True --fp16_opt_level O2 \
  --max_steps 1000 \
  --per_device_train_batch_size 4 \
  --gradient_accumulation_steps 4 \
  --learning_rate 2e-5 \
  --warmup_ratio 0.03 \
  --lr_scheduler_type cosine \
  --model_max_length 2048 \
  --gradient_checkpointing True

use_cache disabled for gradient checkpointing from `defense/white_box/finetune.py:516`:

model.config.use_cache = False

Common Errors

Error Message	Cause	Solution
`CUDA out of memory` during training	Batch size too large for available VRAM	Reduce `per_device_train_batch_size` or increase `gradient_accumulation_steps`
`RuntimeError: expected scalar type Half but found Float`	fp16 dtype mismatch	Ensure `--fp16 True --fp16_opt_level O2` is set
`ValueError: Invalid model_structure`	Wrong model_structure argument	Must be `special_token` (the only supported value)
Checkpoint resume failure	No checkpoint-* directories found	Training starts from scratch if no checkpoints exist; this is expected behavior

Compatibility Notes

DeepSpeed ZeRO Stage 3: Optimizer states are offloaded to CPU with pinned memory for efficiency. This reduces GPU memory requirements but increases CPU RAM usage.
fp16 precision: Training uses fp16 with O2 optimization level. The model weights are gathered in 16-bit for saving (`stage3_gather_16bit_weights_on_model_save`).
Gradient checkpointing: Enabled by default to reduce memory usage during training. Requires `use_cache=False` on the model config.
WandB integration: Training logs are reported to Weights & Biases by default. Configure project and run names via arguments.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment