Principle:Zai org CogVideo Dataset Preparation

Principle Metadata
Name	Dataset_Preparation
Category	Data_Engineering
Domains	Video_Generation, Fine_Tuning, Diffusion_Models
Knowledge Sources	CogVideo Repository, CogVideoX Paper
Last Updated	2026-02-10 00:00 GMT

Overview

Dataset Preparation is a technique for preparing video-text paired datasets for diffusion model fine-tuning by extracting frames, encoding latents, and caching embeddings.

Description

Video diffusion models require paired video-caption data preprocessed into fixed-resolution frames. The dataset preparation principle covers frame extraction from raw video, normalization, resizing to model-compatible dimensions (must satisfy VAE temporal compression constraints where frames - 1 must be a multiple of 8), and optionally pre-encoding video latents and text embeddings to .safetensors cache files for training efficiency.

The preparation pipeline consists of several stages:

Frame Extraction: Raw video files (.mp4) are decoded into individual frames using libraries such as decord or cv2.
Resizing and Normalization: Frames are resized to model-compatible dimensions (e.g., 480x720 for CogVideoX-5B) and normalized to [-1, 1] range.
Temporal Constraint Enforcement: The number of frames must satisfy (F - 1) % 8 == 0 to be compatible with CogVideoX's temporal compression scheme.
Latent Pre-encoding: Video frames are optionally passed through the VAE encoder to produce latent representations, which are cached as .safetensors files.
Text Embedding Caching: Captions are passed through the T5 text encoder and the resulting embeddings are cached alongside the video latents.

Usage

Use when preparing custom video datasets for CogVideoX fine-tuning. This principle is required before any training workflow can begin. It applies equally to text-to-video (T2V) and image-to-video (I2V) fine-tuning scenarios.

Typical workflow:

Organize raw video files and their corresponding captions into a data directory.
Create metadata files (videos.txt and prompts.txt) listing the video filenames and captions.
Run the dataset loader to validate dimensions, extract frames, and optionally pre-encode latents.

Theoretical Basis

Video latent diffusion operates on compressed representations rather than raw pixel data. The CogVideoX VAE applies both spatial and temporal compression:

Spatial compression rate: 8x (each spatial dimension is reduced by a factor of 8)
Temporal compression: The effective temporal compression factor is spatial_compression_rate / patch_t = 4 / 2 = 2 per the CogVideoX architecture, but the internal structure requires that (F - 1) % 8 == 0 where F is the number of input frames.

Pre-encoding video latents and text embeddings saves redundant VAE and T5 forward passes during training. Since the VAE and text encoder weights are frozen during LoRA fine-tuning, their outputs are deterministic and can be safely cached. This reduces per-step training time significantly, especially for large video resolutions.

Valid frame counts for CogVideoX include: 9, 17, 25, 33, 41, 49, etc. (following the formula F = 8k + 1 for positive integer k).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment