Principle:Hpcaitech ColossalAI Booster Plugin Configuration
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Optimization |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A distributed training orchestration pattern that wraps model, optimizer, dataloader, and scheduler with parallelism strategies through a plugin-based abstraction.
Description
The Booster-Plugin pattern is ColossalAI's core abstraction for distributed training. The Booster acts as a unified interface that applies a selected Plugin (parallelism strategy) to transparently handle model sharding, gradient synchronization, memory optimization, and data distribution. This decouples the training logic from the distributed infrastructure.
Available plugins include:
- TorchDDPPlugin: Standard data parallelism
- LowLevelZeroPlugin: ZeRO stages 1/2 for optimizer state and gradient partitioning
- GeminiPlugin: Heterogeneous memory management (CPU+GPU)
- HybridParallelPlugin: Combined tensor/pipeline/sequence/data parallelism (3D parallelism)
Usage
Use this principle whenever training a model with ColossalAI. The plugin choice depends on model size, number of GPUs, and memory constraints. For models that fit in a single GPU, use DDP. For large models requiring memory optimization, use ZeRO or Gemini. For very large models requiring model parallelism, use HybridParallel.
Theoretical Basis
The Booster-Plugin pattern implements a strategy design pattern:
- Plugin Selection: Choose parallelism strategy based on hardware and model size
- Model Wrapping: The plugin wraps the model for distributed execution (e.g., sharding layers across GPUs for tensor parallelism)
- Optimizer Wrapping: The optimizer is wrapped to handle partitioned gradients and optimizer states
- DataLoader Wrapping: The dataloader is wrapped with distributed samplers
- Unified Interface: All training operations (backward, step, save) go through the Booster
Key ZeRO stages:
- Stage 1: Partition optimizer states across ranks
- Stage 2: Additionally partition gradients across ranks
- Stage 3: Additionally partition model parameters across ranks