Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Zai org CogVideo Accelerator Setup

From Leeroopedia


Implementation Metadata
Name Accelerator_Setup
Type Wrapper Doc
Category Infrastructure
Domains Fine_Tuning, Diffusion_Models
Knowledge Sources CogVideo Repository, HuggingFace Accelerate Documentation
Last Updated 2026-02-10 00:00 GMT

Overview

Accelerator_Setup is a concrete tool for configuring HuggingFace Accelerate for distributed CogVideoX training, provided by the accelerate library.

Description

This implementation wraps HuggingFace's Accelerator class to handle distributed training orchestration for CogVideoX fine-tuning. It configures DDP (Distributed Data Parallel) or DeepSpeed ZeRO, sets up mixed precision, gradient accumulation, process group initialization with NCCL backend, and prepares all training components (model, optimizer, dataloader, scheduler) for distributed execution. The implementation resides in the base Trainer class and is called during trainer initialization.

Usage

Use when setting up the distributed training environment for CogVideoX fine-tuning. The Accelerator is initialized during the trainer's _init_distributed method and used throughout training for device placement, gradient synchronization, and checkpoint management.

Code Reference

Source Location

  • finetune/trainer.py:L89-113 -- _init_distributed method
  • finetune/trainer.py:L332-347 -- prepare_for_training method

Signature

Accelerator initialization (in _init_distributed):

accelerator = Accelerator(
    project_config=ProjectConfiguration(
        project_dir=output_dir,
        logging_dir=logging_dir,
    ),
    gradient_accumulation_steps=args.gradient_accumulation_steps,
    mixed_precision=args.mixed_precision,
    log_with=report_to,
    kwargs_handlers=[
        DistributedDataParallelKwargs(find_unused_parameters=True),
        InitProcessGroupKwargs(
            backend="nccl",
            timeout=timedelta(seconds=args.nccl_timeout),
        ),
    ],
)

Component preparation (in prepare_for_training):

transformer, optimizer, data_loader, lr_scheduler = accelerator.prepare(
    transformer, optimizer, data_loader, lr_scheduler
)

Import

from accelerate import Accelerator
from accelerate.utils import ProjectConfiguration, DistributedDataParallelKwargs, InitProcessGroupKwargs

Key Parameters

Parameter Type Default Description
gradient_accumulation_steps int 1 Number of micro-batches to accumulate before an optimizer step.
mixed_precision str from args Precision mode: "no", "fp16", or "bf16".
nccl_timeout int 1800 NCCL process group timeout in seconds.
find_unused_parameters bool True DDP flag to handle unused parameters in forward pass.
log_with str "wandb" or "tensorboard" Logging integration for training metrics.

External Dependencies

  • accelerate -- HuggingFace Accelerate for distributed training
  • torch.distributed -- PyTorch distributed primitives (used via NCCL backend)
  • deepspeed -- (optional) DeepSpeed ZeRO optimization

External Documentation

I/O Contract

Inputs

Input Format Description
Model components transformer, optimizer, data_loader, lr_scheduler Unwrapped PyTorch model, optimizer, dataloader, and learning rate scheduler.
Accelerate config YAML file or environment variables Configuration for distributed backend, number of GPUs, DeepSpeed ZeRO stage.
Training args Args instance Gradient accumulation steps, mixed precision mode, NCCL timeout.

Outputs

Output Format Description
Wrapped components Accelerator-wrapped model, optimizer, dataloader, lr_scheduler Components ready for distributed training with automatic gradient sync and device placement.
Accelerator object Accelerator instance Provides methods for backward(), save_state(), gather(), and other distributed operations.

Usage Examples

Basic Accelerator Initialization

from accelerate import Accelerator
from accelerate.utils import (
    ProjectConfiguration,
    DistributedDataParallelKwargs,
    InitProcessGroupKwargs,
)
from datetime import timedelta

accelerator = Accelerator(
    project_config=ProjectConfiguration(
        project_dir="/output/my_run",
        logging_dir="/output/my_run/logs",
    ),
    gradient_accumulation_steps=4,
    mixed_precision="bf16",
    log_with="wandb",
    kwargs_handlers=[
        DistributedDataParallelKwargs(find_unused_parameters=True),
        InitProcessGroupKwargs(backend="nccl", timeout=timedelta(seconds=1800)),
    ],
)

Preparing Components for Distributed Training

# After model and optimizer creation
transformer, optimizer, data_loader, lr_scheduler = accelerator.prepare(
    transformer, optimizer, data_loader, lr_scheduler
)

# Training loop uses accelerator for backward pass
with accelerator.accumulate(transformer):
    loss = compute_loss(batch)
    accelerator.backward(loss)
    optimizer.step()
    lr_scheduler.step()
    optimizer.zero_grad()

Launching Multi-GPU Training

# Using accelerate launch for multi-GPU
accelerate launch --num_processes 8 --mixed_precision bf16 train.py

# Using DeepSpeed ZeRO Stage 2
accelerate launch --use_deepspeed --deepspeed_config ds_config.yaml train.py

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment