Principle:Zai org CogVideo SAT Dataset Preparation
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | CogVideo |
| Domains | Data_Pipeline, Video_Processing |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Technique for preparing video datasets in WebDataset or directory format for SAT-based CogVideoX training.
Description
SAT training supports two dataset formats, each optimized for different scales and use cases. Both formats share a common video processing pipeline that loads frames using decord, samples at a target FPS, resizes to model dimensions, and normalizes pixel values to the [-1, 1] range.
WebDataset Format (VideoDataset)
WebDataset stores training data as sharded tar archives, where each sample is a set of files with the same basename but different extensions. For CogVideoX training, each sample contains:
- An
.mp4file with the video data. - A caption file (e.g.,
.txtor a field namedcaption) with the text description. - Metadata fields including
duration(float, in seconds) andfps(float, source video frame rate).
The WebDataset format is designed for large-scale distributed training because:
- Streaming access: Data is read sequentially from tar files without random access overhead, enabling efficient I/O on both local disk and remote storage.
- Sharded parallelism: Different workers and ranks can read different shards simultaneously without coordination.
- Shuffle buffering: An in-memory shuffle buffer (default size 1000) provides stochastic sample ordering despite sequential reads.
Directory Format (SFTDataset)
SFTDataset reads from a directory structure containing .mp4 video files with corresponding .txt caption files. The caption files are located by replacing .mp4 with .txt and replacing videos with labels in the path. This format is designed for supervised fine-tuning (SFT) with smaller custom datasets because:
- Simple organization: No tar archiving step required; users simply organize videos and captions in directories.
- Random access: Standard PyTorch Dataset with
__getitem__enables random shuffling. - Variable-length handling: Each video is individually processed with adaptive frame count computation, with padding to
max_num_framesusing last-frame repetition.
Video Processing Pipeline
Both formats share the same core processing steps:
- Frame loading: Use decord
VideoReaderto open the video file and read frame data. - Frame sampling: Select
num_framesframes uniformly distributed between the start and end points, respecting the ratio between source FPS and target FPS. - Skip frames: The
skip_frms_numparameter excludes the first and last N frames from consideration, avoiding transition frames at clip boundaries. - Spatial transform: Resize frames to the target resolution using bicubic interpolation, then apply center or random cropping to match the exact model input dimensions.
- Normalization: Convert pixel values from [0, 255] to [-1, 1] via
(frames - 127.5) / 127.5. - Padding: If the video has fewer frames than required, the last frame is repeated to reach the target frame count.
3D VAE Frame Count Constraint
The CogVideoX 3D VAE requires the number of input frames to satisfy a 4k+1 constraint (e.g., 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45, 49). The SFTDataset class includes logic (nearest_smaller_4k_plus_1) to automatically adjust the frame count for short videos to satisfy this constraint. For the standard CogVideoX-2B configuration, max_num_frames is set to 49.
Usage
Use when preparing data for SAT-based CogVideoX fine-tuning:
- WebDataset: Choose for large distributed training jobs with thousands or more training videos. Requires pre-processing videos into sharded tar archives with metadata.
- SFTDataset: Choose for smaller custom datasets (tens to hundreds of videos) for supervised fine-tuning. Simply organize
.mp4files in a directory with corresponding.txtcaption files.
The dataset format is selected via the data section of the YAML configuration:
data:
target: data_video.SFTDataset # or data_video.VideoDataset
params:
video_size: [480, 720]
fps: 8
max_num_frames: 49
skip_frms_num: 3.0
Theoretical Basis
WebDataset Streaming
WebDataset enables streaming from sharded tar files for distributed training without random access overhead. Traditional random-access datasets require either loading all data into memory or performing random seeks to disk, both of which scale poorly with dataset size. Tar-based sharding provides sequential I/O patterns that maximize disk bandwidth utilization and naturally partition data across distributed workers.
Frame Sampling at Target FPS
Videos are recorded at varying frame rates (24, 30, 60+ FPS), but the CogVideoX model expects a fixed number of frames at a specific temporal resolution. Frame sampling at target FPS (typically 8 FPS) with uniform index selection ensures that the temporal extent of the selected frames matches the model's expectations regardless of the source video's native frame rate.
Skip Frames for Clean Boundaries
The skip_frms_num parameter (default 3 for SFTDataset) removes transition frames at the beginning and end of video clips. Many video clips, especially those from edited content, contain fade-in/fade-out or scene transitions at their boundaries. Excluding these frames provides cleaner training data that better represents the steady-state visual content.