Implementation:Zai org CogVideo T2V I2V Dataset Loader

Implementation Metadata
Name	T2V_I2V_Dataset_Loader
Type	API Doc
Category	Data_Engineering
Domains	Video_Generation, Fine_Tuning, Diffusion_Models
Knowledge Sources	CogVideo Repository, CogVideoX Paper
Last Updated	2026-02-10 00:00 GMT

Overview

T2V_I2V_Dataset_Loader is a concrete tool for loading and preprocessing video-text datasets for CogVideoX fine-tuning, provided by the CogVideo finetune package.

Description

This implementation provides two primary dataset classes -- T2VDatasetWithResize for text-to-video tasks and I2VDatasetWithResize for image-to-video tasks. Both classes handle video file loading, frame extraction, resizing to specified dimensions, and optional pre-encoding of video latents and text embeddings into cached .safetensors files. An auxiliary script extract_images.py is provided for extracting the first frame from videos for I2V training.

Usage

Use this implementation when setting up a data pipeline for CogVideoX fine-tuning. The dataset classes are instantiated with resolution parameters and a data root directory containing video files and caption metadata files.

Code Reference

Source Location

finetune/datasets/t2v_dataset.py:L194-229 -- T2VDatasetWithResize
finetune/datasets/i2v_dataset.py:L237-285 -- I2VDatasetWithResize
finetune/scripts/extract_images.py:L1-61 -- First-frame extraction script

Signature

class T2VDatasetWithResize(BaseT2VDataset):
    def __init__(
        self,
        max_num_frames: int,
        height: int,
        width: int,
        *args,
        **kwargs
    ) -> None:
        super().__init__(*args, **kwargs)
        self.max_num_frames = max_num_frames
        self.height = height
        self.width = width

class I2VDatasetWithResize(BaseI2VDataset):
    def __init__(
        self,
        max_num_frames: int,
        height: int,
        width: int,
        *args,
        **kwargs
    ) -> None:
        super().__init__(*args, **kwargs)
        self.max_num_frames = max_num_frames
        self.height = height
        self.width = width

Import

from finetune.datasets import T2VDatasetWithResize, I2VDatasetWithResize

Key Parameters

Parameter	Type	Description
`max_num_frames`	`int`	Maximum number of frames to extract from each video. Must satisfy `(F - 1) % 8 == 0`.
`height`	`int`	Target height for resized video frames (e.g., 480).
`width`	`int`	Target width for resized video frames (e.g., 720).
`data_root`	`str`	Root directory containing video files and metadata.
`caption_column`	`str`	Path to file listing captions (e.g., `prompts.txt`).
`video_column`	`str`	Path to file listing video filenames (e.g., `videos.txt`).

External Dependencies

cv2 -- Video frame extraction
decord -- Efficient video decoding
torchvision.transforms -- Image transformations
safetensors -- Cached latent storage

I/O Contract

Inputs

Input	Format	Description
Video files	`.mp4` files in `data_root/`	Raw video files to be processed.
Caption file	Text file (`prompts.txt`)	One caption per line, corresponding to videos.
Video list file	Text file (`videos.txt`)	One video filename per line.

Outputs

Output	Format	Description
Dataset item	`dict` with keys `"encoded_video"` (Tensor) and `"prompt_embedding"` (Tensor)	Pre-encoded video latents and text embeddings per sample.
Cached latents	`.safetensors` files in `data_root/cache/`	Persisted VAE latents and T5 embeddings for reuse across epochs.

Usage Examples

T2V Dataset Initialization

from finetune.datasets import T2VDatasetWithResize

dataset = T2VDatasetWithResize(
    max_num_frames=49,
    height=480,
    width=720,
    data_root="/path/to/data",
    caption_column="prompts.txt",
    video_column="videos.txt",
)

# Access a sample
sample = dataset[0]
encoded_video = sample["encoded_video"]   # Tensor [C, F, H, W]
prompt_emb = sample["prompt_embedding"]   # Tensor [seq_len, hidden_size]

I2V Dataset Initialization

from finetune.datasets import I2VDatasetWithResize

dataset = I2VDatasetWithResize(
    max_num_frames=49,
    height=480,
    width=720,
    data_root="/path/to/data",
    caption_column="prompts.txt",
    video_column="videos.txt",
)

First-Frame Extraction

python finetune/scripts/extract_images.py \
    --video_dir /path/to/videos \
    --output_dir /path/to/first_frames

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment