Principle:Zai org CogVideo Diffusion Training Loop
| Principle Metadata | |
|---|---|
| Name | Diffusion_Training_Loop |
| Category | Training |
| Domains | Video_Generation, Fine_Tuning, Diffusion_Models |
| Knowledge Sources | CogVideo Repository, CogVideoX Paper |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Diffusion Training Loop is a technique for training video diffusion models by predicting and removing noise from randomly corrupted video latents conditioned on text embeddings.
Description
The diffusion training loop implements the denoising score matching objective. For each training batch, the loop performs the following steps:
- Sample a random timestep
tuniformly from the noise schedule. - Add noise to the clean video latents
x_0according to the schedule at timestept, producing noisy latentsx_t. - Predict the noise (or velocity) using the CogVideoX transformer, conditioned on text embeddings and the timestep.
- Compute the loss as a weighted mean squared error between the prediction and the target.
- Backpropagate and update LoRA adapter weights via the optimizer.
The loss uses SNR-based weighting (1 / (1 - alpha_cumprod)) to balance contributions across timesteps. Rotary positional embeddings provide 3D spatio-temporal position information to the transformer, enabling it to understand the spatial and temporal structure of the video.
Usage
Use when fine-tuning CogVideoX models on custom video-text datasets. The training loop handles gradient accumulation, mixed precision, and gradient clipping automatically through the Accelerator integration.
Theoretical Basis
Denoising Diffusion Objective
The core training objective is:
- L = E[w(t) * ||f_theta(x_t, t, c) - target||^2]
where:
x_t = alpha_t * x_0 + sigma_t * epsilonis the noisy latent at timesteptf_thetais the CogVideoX transformer (with LoRA adapters)cis the text conditioning (T5 embeddings)w(t) = 1 / (1 - alpha_bar_t)is the SNR (signal-to-noise ratio) weighting
V-Prediction Parameterization
CogVideoX uses velocity prediction (v-prediction) rather than epsilon prediction:
- v = alpha_t * epsilon - sigma_t * x_0
This parameterization provides better training stability and sample quality at both high and low noise levels compared to the standard epsilon-prediction used in DDPM.
SNR Weighting
The signal-to-noise ratio weighting ensures that the model allocates appropriate capacity to all noise levels:
- At low timesteps (low noise): The SNR is high, and the weighting is low. The model focuses on fine details.
- At high timesteps (high noise): The SNR is low, and the weighting is high. The model focuses on global structure.
This weighting prevents the model from over-fitting to easy (low-noise) denoising tasks while under-fitting on the harder (high-noise) tasks.
Gradient Management
- Gradient clipping: Applied with
max_grad_norm=1.0to prevent exploding gradients. - Gradient accumulation: Effective batch size =
per_gpu_batch * num_gpus * accumulation_steps. - Mixed precision: Forward and backward passes in bf16/fp16; optimizer step in fp32.
Related Pages
- Implementation:Zai_org_CogVideo_Trainer_Train
- Principle:Zai_org_CogVideo_Model_Loading_and_LoRA_Injection
- Principle:Zai_org_CogVideo_Distributed_Training_Setup
- Principle:Zai_org_CogVideo_Checkpointing_and_Validation
- Heuristic:Zai_org_CogVideo_Memory_Optimization_Strategies
- Heuristic:Zai_org_CogVideo_Training_Hyperparameter_Defaults