Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo Diffusion Training Loop

From Leeroopedia


Principle Metadata
Name Diffusion_Training_Loop
Category Training
Domains Video_Generation, Fine_Tuning, Diffusion_Models
Knowledge Sources CogVideo Repository, CogVideoX Paper
Last Updated 2026-02-10 00:00 GMT

Overview

Diffusion Training Loop is a technique for training video diffusion models by predicting and removing noise from randomly corrupted video latents conditioned on text embeddings.

Description

The diffusion training loop implements the denoising score matching objective. For each training batch, the loop performs the following steps:

  1. Sample a random timestep t uniformly from the noise schedule.
  2. Add noise to the clean video latents x_0 according to the schedule at timestep t, producing noisy latents x_t.
  3. Predict the noise (or velocity) using the CogVideoX transformer, conditioned on text embeddings and the timestep.
  4. Compute the loss as a weighted mean squared error between the prediction and the target.
  5. Backpropagate and update LoRA adapter weights via the optimizer.

The loss uses SNR-based weighting (1 / (1 - alpha_cumprod)) to balance contributions across timesteps. Rotary positional embeddings provide 3D spatio-temporal position information to the transformer, enabling it to understand the spatial and temporal structure of the video.

Usage

Use when fine-tuning CogVideoX models on custom video-text datasets. The training loop handles gradient accumulation, mixed precision, and gradient clipping automatically through the Accelerator integration.

Theoretical Basis

Denoising Diffusion Objective

The core training objective is:

L = E[w(t) * ||f_theta(x_t, t, c) - target||^2]

where:

  • x_t = alpha_t * x_0 + sigma_t * epsilon is the noisy latent at timestep t
  • f_theta is the CogVideoX transformer (with LoRA adapters)
  • c is the text conditioning (T5 embeddings)
  • w(t) = 1 / (1 - alpha_bar_t) is the SNR (signal-to-noise ratio) weighting

V-Prediction Parameterization

CogVideoX uses velocity prediction (v-prediction) rather than epsilon prediction:

v = alpha_t * epsilon - sigma_t * x_0

This parameterization provides better training stability and sample quality at both high and low noise levels compared to the standard epsilon-prediction used in DDPM.

SNR Weighting

The signal-to-noise ratio weighting ensures that the model allocates appropriate capacity to all noise levels:

  • At low timesteps (low noise): The SNR is high, and the weighting is low. The model focuses on fine details.
  • At high timesteps (high noise): The SNR is low, and the weighting is high. The model focuses on global structure.

This weighting prevents the model from over-fitting to easy (low-noise) denoising tasks while under-fitting on the harder (high-noise) tasks.

Gradient Management

  • Gradient clipping: Applied with max_grad_norm=1.0 to prevent exploding gradients.
  • Gradient accumulation: Effective batch size = per_gpu_batch * num_gpus * accumulation_steps.
  • Mixed precision: Forward and backward passes in bf16/fp16; optimizer step in fp32.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment