Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Zai org CogVideo T2V I2V Dataset Loader

From Leeroopedia


Implementation Metadata
Name T2V_I2V_Dataset_Loader
Type API Doc
Category Data_Engineering
Domains Video_Generation, Fine_Tuning, Diffusion_Models
Knowledge Sources CogVideo Repository, CogVideoX Paper
Last Updated 2026-02-10 00:00 GMT

Overview

T2V_I2V_Dataset_Loader is a concrete tool for loading and preprocessing video-text datasets for CogVideoX fine-tuning, provided by the CogVideo finetune package.

Description

This implementation provides two primary dataset classes -- T2VDatasetWithResize for text-to-video tasks and I2VDatasetWithResize for image-to-video tasks. Both classes handle video file loading, frame extraction, resizing to specified dimensions, and optional pre-encoding of video latents and text embeddings into cached .safetensors files. An auxiliary script extract_images.py is provided for extracting the first frame from videos for I2V training.

Usage

Use this implementation when setting up a data pipeline for CogVideoX fine-tuning. The dataset classes are instantiated with resolution parameters and a data root directory containing video files and caption metadata files.

Code Reference

Source Location

  • finetune/datasets/t2v_dataset.py:L194-229 -- T2VDatasetWithResize
  • finetune/datasets/i2v_dataset.py:L237-285 -- I2VDatasetWithResize
  • finetune/scripts/extract_images.py:L1-61 -- First-frame extraction script

Signature

class T2VDatasetWithResize(BaseT2VDataset):
    def __init__(
        self,
        max_num_frames: int,
        height: int,
        width: int,
        *args,
        **kwargs
    ) -> None:
        super().__init__(*args, **kwargs)
        self.max_num_frames = max_num_frames
        self.height = height
        self.width = width
class I2VDatasetWithResize(BaseI2VDataset):
    def __init__(
        self,
        max_num_frames: int,
        height: int,
        width: int,
        *args,
        **kwargs
    ) -> None:
        super().__init__(*args, **kwargs)
        self.max_num_frames = max_num_frames
        self.height = height
        self.width = width

Import

from finetune.datasets import T2VDatasetWithResize, I2VDatasetWithResize

Key Parameters

Parameter Type Description
max_num_frames int Maximum number of frames to extract from each video. Must satisfy (F - 1) % 8 == 0.
height int Target height for resized video frames (e.g., 480).
width int Target width for resized video frames (e.g., 720).
data_root str Root directory containing video files and metadata.
caption_column str Path to file listing captions (e.g., prompts.txt).
video_column str Path to file listing video filenames (e.g., videos.txt).

External Dependencies

  • cv2 -- Video frame extraction
  • decord -- Efficient video decoding
  • torchvision.transforms -- Image transformations
  • safetensors -- Cached latent storage

I/O Contract

Inputs

Input Format Description
Video files .mp4 files in data_root/ Raw video files to be processed.
Caption file Text file (prompts.txt) One caption per line, corresponding to videos.
Video list file Text file (videos.txt) One video filename per line.

Outputs

Output Format Description
Dataset item dict with keys "encoded_video" (Tensor) and "prompt_embedding" (Tensor) Pre-encoded video latents and text embeddings per sample.
Cached latents .safetensors files in data_root/cache/ Persisted VAE latents and T5 embeddings for reuse across epochs.

Usage Examples

T2V Dataset Initialization

from finetune.datasets import T2VDatasetWithResize

dataset = T2VDatasetWithResize(
    max_num_frames=49,
    height=480,
    width=720,
    data_root="/path/to/data",
    caption_column="prompts.txt",
    video_column="videos.txt",
)

# Access a sample
sample = dataset[0]
encoded_video = sample["encoded_video"]   # Tensor [C, F, H, W]
prompt_emb = sample["prompt_embedding"]   # Tensor [seq_len, hidden_size]

I2V Dataset Initialization

from finetune.datasets import I2VDatasetWithResize

dataset = I2VDatasetWithResize(
    max_num_frames=49,
    height=480,
    width=720,
    data_root="/path/to/data",
    caption_column="prompts.txt",
    video_column="videos.txt",
)

First-Frame Extraction

python finetune/scripts/extract_images.py \
    --video_dir /path/to/videos \
    --output_dir /path/to/first_frames

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment