Principle:Zai org CogVideo Dataset Preparation
| Principle Metadata | |
|---|---|
| Name | Dataset_Preparation |
| Category | Data_Engineering |
| Domains | Video_Generation, Fine_Tuning, Diffusion_Models |
| Knowledge Sources | CogVideo Repository, CogVideoX Paper |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Dataset Preparation is a technique for preparing video-text paired datasets for diffusion model fine-tuning by extracting frames, encoding latents, and caching embeddings.
Description
Video diffusion models require paired video-caption data preprocessed into fixed-resolution frames. The dataset preparation principle covers frame extraction from raw video, normalization, resizing to model-compatible dimensions (must satisfy VAE temporal compression constraints where frames - 1 must be a multiple of 8), and optionally pre-encoding video latents and text embeddings to .safetensors cache files for training efficiency.
The preparation pipeline consists of several stages:
- Frame Extraction: Raw video files (.mp4) are decoded into individual frames using libraries such as
decordorcv2. - Resizing and Normalization: Frames are resized to model-compatible dimensions (e.g., 480x720 for CogVideoX-5B) and normalized to [-1, 1] range.
- Temporal Constraint Enforcement: The number of frames must satisfy
(F - 1) % 8 == 0to be compatible with CogVideoX's temporal compression scheme. - Latent Pre-encoding: Video frames are optionally passed through the VAE encoder to produce latent representations, which are cached as
.safetensorsfiles. - Text Embedding Caching: Captions are passed through the T5 text encoder and the resulting embeddings are cached alongside the video latents.
Usage
Use when preparing custom video datasets for CogVideoX fine-tuning. This principle is required before any training workflow can begin. It applies equally to text-to-video (T2V) and image-to-video (I2V) fine-tuning scenarios.
Typical workflow:
- Organize raw video files and their corresponding captions into a data directory.
- Create metadata files (
videos.txtandprompts.txt) listing the video filenames and captions. - Run the dataset loader to validate dimensions, extract frames, and optionally pre-encode latents.
Theoretical Basis
Video latent diffusion operates on compressed representations rather than raw pixel data. The CogVideoX VAE applies both spatial and temporal compression:
- Spatial compression rate: 8x (each spatial dimension is reduced by a factor of 8)
- Temporal compression: The effective temporal compression factor is
spatial_compression_rate / patch_t = 4 / 2 = 2per the CogVideoX architecture, but the internal structure requires that(F - 1) % 8 == 0where F is the number of input frames.
Pre-encoding video latents and text embeddings saves redundant VAE and T5 forward passes during training. Since the VAE and text encoder weights are frozen during LoRA fine-tuning, their outputs are deterministic and can be safely cached. This reduces per-step training time significantly, especially for large video resolutions.
Valid frame counts for CogVideoX include: 9, 17, 25, 33, 41, 49, etc. (following the formula F = 8k + 1 for positive integer k).