Environment:Microsoft BIPIA DeepSpeed Finetuning Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-14 15:00 GMT |
Overview
Multi-GPU DeepSpeed ZeRO Stage 3 environment with 8x V100 GPUs for white-box defense fine-tuning of LLMs against indirect prompt injection attacks.
Description
This environment provides the distributed training infrastructure for the BIPIA white-box defense fine-tuning pipeline. It uses DeepSpeed ZeRO Stage 3 with CPU optimizer offloading, fp16 mixed precision training, and gradient checkpointing. The training pipeline is built on HuggingFace Transformers Trainer with DeepSpeed integration. The configuration uses cosine learning rate scheduling with warmup, AdamW optimizer, and a maximum sequence length of 2048 tokens.
Usage
Use this environment when running white-box defense fine-tuning experiments. This is required for the HF_Trainer_For_Defense, Load_Bipia_Supervised_Data_Module, Tokenize_Fn, and Smart_Tokenizer_And_Embedding_Resize implementations. The environment is launched via the `deepspeed` CLI command rather than standard Python.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Ubuntu 20.04 LTS | Tested configuration |
| Hardware | 8x NVIDIA V100 GPUs | Used in paper experiments; A100/H100 also viable |
| RAM | Sufficient for CPU offloading | ZeRO Stage 3 offloads optimizer states to CPU |
| Disk | Sufficient for model checkpoints | Saves checkpoints every 100 steps with up to 100 total |
| Python | >= 3.8 | Required by the bipia package |
Dependencies
Python Packages
- `deepspeed` >= 0.9.5
- `torch` >= 2.0.1
- `transformers` >= 4.34.0
- `accelerate` >= 0.15.0
- `peft` (any version)
- `wandb` (for experiment tracking)
- `datasets` >= 2.8.0
- `jsonlines` (any version)
Credentials
- `WANDB_PROJECT`: Weights & Biases project name (configured via TrainingArguments or environment variable).
- `WANDB_RUN`: Weights & Biases run name (configured via TrainingArguments or environment variable).
- `auth_token`: HuggingFace token for gated models (configured in YAML config file).
Quick Install
# Install bipia with all dependencies
pip install .
# Verify DeepSpeed installation
ds_report
Code Evidence
DeepSpeed ZeRO Stage 3 configuration from `defense/white_box/ds_config.json:1-18`:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"stage3_gather_16bit_weights_on_model_save": true
},
"fp16": {
"enabled": "auto",
"auto_cast": true
}
}
Training arguments with gradient checkpointing from `defense/README.md:106-144`:
deepspeed finetune.py \
--fp16 True --fp16_opt_level O2 \
--max_steps 1000 \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-5 \
--warmup_ratio 0.03 \
--lr_scheduler_type cosine \
--model_max_length 2048 \
--gradient_checkpointing True
use_cache disabled for gradient checkpointing from `defense/white_box/finetune.py:516`:
model.config.use_cache = False
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `CUDA out of memory` during training | Batch size too large for available VRAM | Reduce `per_device_train_batch_size` or increase `gradient_accumulation_steps` |
| `RuntimeError: expected scalar type Half but found Float` | fp16 dtype mismatch | Ensure `--fp16 True --fp16_opt_level O2` is set |
| `ValueError: Invalid model_structure` | Wrong model_structure argument | Must be `special_token` (the only supported value) |
| Checkpoint resume failure | No checkpoint-* directories found | Training starts from scratch if no checkpoints exist; this is expected behavior |
Compatibility Notes
- DeepSpeed ZeRO Stage 3: Optimizer states are offloaded to CPU with pinned memory for efficiency. This reduces GPU memory requirements but increases CPU RAM usage.
- fp16 precision: Training uses fp16 with O2 optimization level. The model weights are gathered in 16-bit for saving (`stage3_gather_16bit_weights_on_model_save`).
- Gradient checkpointing: Enabled by default to reduce memory usage during training. Requires `use_cache=False` on the model config.
- WandB integration: Training logs are reported to Weights & Biases by default. Configure project and run names via arguments.