Implementation:Zai org CogVideo Accelerator Setup
| Implementation Metadata | |
|---|---|
| Name | Accelerator_Setup |
| Type | Wrapper Doc |
| Category | Infrastructure |
| Domains | Fine_Tuning, Diffusion_Models |
| Knowledge Sources | CogVideo Repository, HuggingFace Accelerate Documentation |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Accelerator_Setup is a concrete tool for configuring HuggingFace Accelerate for distributed CogVideoX training, provided by the accelerate library.
Description
This implementation wraps HuggingFace's Accelerator class to handle distributed training orchestration for CogVideoX fine-tuning. It configures DDP (Distributed Data Parallel) or DeepSpeed ZeRO, sets up mixed precision, gradient accumulation, process group initialization with NCCL backend, and prepares all training components (model, optimizer, dataloader, scheduler) for distributed execution. The implementation resides in the base Trainer class and is called during trainer initialization.
Usage
Use when setting up the distributed training environment for CogVideoX fine-tuning. The Accelerator is initialized during the trainer's _init_distributed method and used throughout training for device placement, gradient synchronization, and checkpoint management.
Code Reference
Source Location
finetune/trainer.py:L89-113--_init_distributedmethodfinetune/trainer.py:L332-347--prepare_for_trainingmethod
Signature
Accelerator initialization (in _init_distributed):
accelerator = Accelerator(
project_config=ProjectConfiguration(
project_dir=output_dir,
logging_dir=logging_dir,
),
gradient_accumulation_steps=args.gradient_accumulation_steps,
mixed_precision=args.mixed_precision,
log_with=report_to,
kwargs_handlers=[
DistributedDataParallelKwargs(find_unused_parameters=True),
InitProcessGroupKwargs(
backend="nccl",
timeout=timedelta(seconds=args.nccl_timeout),
),
],
)
Component preparation (in prepare_for_training):
transformer, optimizer, data_loader, lr_scheduler = accelerator.prepare(
transformer, optimizer, data_loader, lr_scheduler
)
Import
from accelerate import Accelerator
from accelerate.utils import ProjectConfiguration, DistributedDataParallelKwargs, InitProcessGroupKwargs
Key Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
gradient_accumulation_steps |
int |
1 |
Number of micro-batches to accumulate before an optimizer step. |
mixed_precision |
str |
from args | Precision mode: "no", "fp16", or "bf16".
|
nccl_timeout |
int |
1800 |
NCCL process group timeout in seconds. |
find_unused_parameters |
bool |
True |
DDP flag to handle unused parameters in forward pass. |
log_with |
str |
"wandb" or "tensorboard" |
Logging integration for training metrics. |
External Dependencies
accelerate-- HuggingFace Accelerate for distributed trainingtorch.distributed-- PyTorch distributed primitives (used via NCCL backend)deepspeed-- (optional) DeepSpeed ZeRO optimization
External Documentation
I/O Contract
Inputs
| Input | Format | Description |
|---|---|---|
| Model components | transformer, optimizer, data_loader, lr_scheduler |
Unwrapped PyTorch model, optimizer, dataloader, and learning rate scheduler. |
| Accelerate config | YAML file or environment variables | Configuration for distributed backend, number of GPUs, DeepSpeed ZeRO stage. |
| Training args | Args instance |
Gradient accumulation steps, mixed precision mode, NCCL timeout. |
Outputs
| Output | Format | Description |
|---|---|---|
| Wrapped components | Accelerator-wrapped model, optimizer, dataloader, lr_scheduler | Components ready for distributed training with automatic gradient sync and device placement. |
| Accelerator object | Accelerator instance |
Provides methods for backward(), save_state(), gather(), and other distributed operations.
|
Usage Examples
Basic Accelerator Initialization
from accelerate import Accelerator
from accelerate.utils import (
ProjectConfiguration,
DistributedDataParallelKwargs,
InitProcessGroupKwargs,
)
from datetime import timedelta
accelerator = Accelerator(
project_config=ProjectConfiguration(
project_dir="/output/my_run",
logging_dir="/output/my_run/logs",
),
gradient_accumulation_steps=4,
mixed_precision="bf16",
log_with="wandb",
kwargs_handlers=[
DistributedDataParallelKwargs(find_unused_parameters=True),
InitProcessGroupKwargs(backend="nccl", timeout=timedelta(seconds=1800)),
],
)
Preparing Components for Distributed Training
# After model and optimizer creation
transformer, optimizer, data_loader, lr_scheduler = accelerator.prepare(
transformer, optimizer, data_loader, lr_scheduler
)
# Training loop uses accelerator for backward pass
with accelerator.accumulate(transformer):
loss = compute_loss(batch)
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
Launching Multi-GPU Training
# Using accelerate launch for multi-GPU
accelerate launch --num_processes 8 --mixed_precision bf16 train.py
# Using DeepSpeed ZeRO Stage 2
accelerate launch --use_deepspeed --deepspeed_config ds_config.yaml train.py