Principle:Zai org CogVideo SAT Dataset Preparation

Metadata

Field	Value
Page Type	Principle
Knowledge Sources	CogVideo
Domains	Data_Pipeline, Video_Processing
Last Updated	2026-02-10 00:00 GMT

Overview

Technique for preparing video datasets in WebDataset or directory format for SAT-based CogVideoX training.

Description

SAT training supports two dataset formats, each optimized for different scales and use cases. Both formats share a common video processing pipeline that loads frames using decord, samples at a target FPS, resizes to model dimensions, and normalizes pixel values to the [-1, 1] range.

WebDataset Format (VideoDataset)

WebDataset stores training data as sharded tar archives, where each sample is a set of files with the same basename but different extensions. For CogVideoX training, each sample contains:

An .mp4 file with the video data.
A caption file (e.g., .txt or a field named caption) with the text description.
Metadata fields including duration (float, in seconds) and fps (float, source video frame rate).

The WebDataset format is designed for large-scale distributed training because:

Streaming access: Data is read sequentially from tar files without random access overhead, enabling efficient I/O on both local disk and remote storage.
Sharded parallelism: Different workers and ranks can read different shards simultaneously without coordination.
Shuffle buffering: An in-memory shuffle buffer (default size 1000) provides stochastic sample ordering despite sequential reads.

Directory Format (SFTDataset)

SFTDataset reads from a directory structure containing .mp4 video files with corresponding .txt caption files. The caption files are located by replacing .mp4 with .txt and replacing videos with labels in the path. This format is designed for supervised fine-tuning (SFT) with smaller custom datasets because:

Simple organization: No tar archiving step required; users simply organize videos and captions in directories.
Random access: Standard PyTorch Dataset with __getitem__ enables random shuffling.
Variable-length handling: Each video is individually processed with adaptive frame count computation, with padding to max_num_frames using last-frame repetition.

Video Processing Pipeline

Both formats share the same core processing steps:

Frame loading: Use decord VideoReader to open the video file and read frame data.
Frame sampling: Select num_frames frames uniformly distributed between the start and end points, respecting the ratio between source FPS and target FPS.
Skip frames: The skip_frms_num parameter excludes the first and last N frames from consideration, avoiding transition frames at clip boundaries.
Spatial transform: Resize frames to the target resolution using bicubic interpolation, then apply center or random cropping to match the exact model input dimensions.
Normalization: Convert pixel values from [0, 255] to [-1, 1] via (frames - 127.5) / 127.5.
Padding: If the video has fewer frames than required, the last frame is repeated to reach the target frame count.

3D VAE Frame Count Constraint

The CogVideoX 3D VAE requires the number of input frames to satisfy a 4k+1 constraint (e.g., 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45, 49). The SFTDataset class includes logic (nearest_smaller_4k_plus_1) to automatically adjust the frame count for short videos to satisfy this constraint. For the standard CogVideoX-2B configuration, max_num_frames is set to 49.

Usage

Use when preparing data for SAT-based CogVideoX fine-tuning:

WebDataset: Choose for large distributed training jobs with thousands or more training videos. Requires pre-processing videos into sharded tar archives with metadata.
SFTDataset: Choose for smaller custom datasets (tens to hundreds of videos) for supervised fine-tuning. Simply organize .mp4 files in a directory with corresponding .txt caption files.

The dataset format is selected via the data section of the YAML configuration:

data:
  target: data_video.SFTDataset  # or data_video.VideoDataset
  params:
    video_size: [480, 720]
    fps: 8
    max_num_frames: 49
    skip_frms_num: 3.0

Theoretical Basis

WebDataset Streaming

WebDataset enables streaming from sharded tar files for distributed training without random access overhead. Traditional random-access datasets require either loading all data into memory or performing random seeks to disk, both of which scale poorly with dataset size. Tar-based sharding provides sequential I/O patterns that maximize disk bandwidth utilization and naturally partition data across distributed workers.

Frame Sampling at Target FPS

Videos are recorded at varying frame rates (24, 30, 60+ FPS), but the CogVideoX model expects a fixed number of frames at a specific temporal resolution. Frame sampling at target FPS (typically 8 FPS) with uniform index selection ensures that the temporal extent of the selected frames matches the model's expectations regardless of the source video's native frame rate.

Skip Frames for Clean Boundaries

The skip_frms_num parameter (default 3 for SFTDataset) removes transition frames at the beginning and end of video clips. Many video clips, especially those from edited content, contain fade-in/fade-out or scene transitions at their boundaries. Excluding these frames provides cleaner training data that better represents the steady-state visual content.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment