Implementation:Zai org CogVideo T2V I2V Dataset Loader
| Implementation Metadata | |
|---|---|
| Name | T2V_I2V_Dataset_Loader |
| Type | API Doc |
| Category | Data_Engineering |
| Domains | Video_Generation, Fine_Tuning, Diffusion_Models |
| Knowledge Sources | CogVideo Repository, CogVideoX Paper |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
T2V_I2V_Dataset_Loader is a concrete tool for loading and preprocessing video-text datasets for CogVideoX fine-tuning, provided by the CogVideo finetune package.
Description
This implementation provides two primary dataset classes -- T2VDatasetWithResize for text-to-video tasks and I2VDatasetWithResize for image-to-video tasks. Both classes handle video file loading, frame extraction, resizing to specified dimensions, and optional pre-encoding of video latents and text embeddings into cached .safetensors files. An auxiliary script extract_images.py is provided for extracting the first frame from videos for I2V training.
Usage
Use this implementation when setting up a data pipeline for CogVideoX fine-tuning. The dataset classes are instantiated with resolution parameters and a data root directory containing video files and caption metadata files.
Code Reference
Source Location
finetune/datasets/t2v_dataset.py:L194-229--T2VDatasetWithResizefinetune/datasets/i2v_dataset.py:L237-285--I2VDatasetWithResizefinetune/scripts/extract_images.py:L1-61-- First-frame extraction script
Signature
class T2VDatasetWithResize(BaseT2VDataset):
def __init__(
self,
max_num_frames: int,
height: int,
width: int,
*args,
**kwargs
) -> None:
super().__init__(*args, **kwargs)
self.max_num_frames = max_num_frames
self.height = height
self.width = width
class I2VDatasetWithResize(BaseI2VDataset):
def __init__(
self,
max_num_frames: int,
height: int,
width: int,
*args,
**kwargs
) -> None:
super().__init__(*args, **kwargs)
self.max_num_frames = max_num_frames
self.height = height
self.width = width
Import
from finetune.datasets import T2VDatasetWithResize, I2VDatasetWithResize
Key Parameters
| Parameter | Type | Description |
|---|---|---|
max_num_frames |
int |
Maximum number of frames to extract from each video. Must satisfy (F - 1) % 8 == 0.
|
height |
int |
Target height for resized video frames (e.g., 480). |
width |
int |
Target width for resized video frames (e.g., 720). |
data_root |
str |
Root directory containing video files and metadata. |
caption_column |
str |
Path to file listing captions (e.g., prompts.txt).
|
video_column |
str |
Path to file listing video filenames (e.g., videos.txt).
|
External Dependencies
cv2-- Video frame extractiondecord-- Efficient video decodingtorchvision.transforms-- Image transformationssafetensors-- Cached latent storage
I/O Contract
Inputs
| Input | Format | Description |
|---|---|---|
| Video files | .mp4 files in data_root/ |
Raw video files to be processed. |
| Caption file | Text file (prompts.txt) |
One caption per line, corresponding to videos. |
| Video list file | Text file (videos.txt) |
One video filename per line. |
Outputs
| Output | Format | Description |
|---|---|---|
| Dataset item | dict with keys "encoded_video" (Tensor) and "prompt_embedding" (Tensor) |
Pre-encoded video latents and text embeddings per sample. |
| Cached latents | .safetensors files in data_root/cache/ |
Persisted VAE latents and T5 embeddings for reuse across epochs. |
Usage Examples
T2V Dataset Initialization
from finetune.datasets import T2VDatasetWithResize
dataset = T2VDatasetWithResize(
max_num_frames=49,
height=480,
width=720,
data_root="/path/to/data",
caption_column="prompts.txt",
video_column="videos.txt",
)
# Access a sample
sample = dataset[0]
encoded_video = sample["encoded_video"] # Tensor [C, F, H, W]
prompt_emb = sample["prompt_embedding"] # Tensor [seq_len, hidden_size]
I2V Dataset Initialization
from finetune.datasets import I2VDatasetWithResize
dataset = I2VDatasetWithResize(
max_num_frames=49,
height=480,
width=720,
data_root="/path/to/data",
caption_column="prompts.txt",
video_column="videos.txt",
)
First-Frame Extraction
python finetune/scripts/extract_images.py \
--video_dir /path/to/videos \
--output_dir /path/to/first_frames
Related Pages
- Principle:Zai_org_CogVideo_Dataset_Preparation
- Environment:Zai_org_CogVideo_Diffusers_Finetuning_Environment
- Heuristic:Zai_org_CogVideo_Data_Preparation_Best_Practices
- Heuristic:Zai_org_CogVideo_Frame_Count_and_Resolution_Constraints
- Heuristic:Zai_org_CogVideo_Decord_Import_Order_Bug
- Implementation:Zai_org_CogVideo_Args_Parse_Args