Heuristic:Zai org CogVideo Data Preparation Best Practices
| Knowledge Sources | |
|---|---|
| Domains | Data_Preparation, Finetuning, Video_Generation |
| Last Updated | 2026-02-10 02:00 GMT |
Overview
Video data preparation guidelines: use 25+ frame videos, prepare 50+ videos for style finetuning, preprocess with crop+resize to maintain aspect ratio, import decord after torch, and always delete latent cache after modifying data.
Description
Proper data preparation is critical for CogVideoX fine-tuning quality. The training pipeline has several non-obvious requirements and behaviors: naive resize distorts aspect ratios, the latent caching system can serve stale data, decord has an import-order bug, and short videos are silently padded by repeating the last frame. These issues are not always visible as errors but significantly impact output quality.
Usage
Apply these practices before starting any fine-tuning run to ensure training data quality. Also check after modifying training data between runs.
The Insight (Rule of Thumb)
Video quantity:
- Style fine-tuning: At least 50 videos with similar style (SAT README recommendation).
- Concept training: Videos with 25+ frames work best for training new concepts and styles.
Aspect ratio:
- Action: Always preprocess videos with center-crop + resize to maintain aspect ratio before training.
- Why: The default training code uses naive resize which distorts aspect ratios. The Gradio demo implements proper center-crop-resize as a reference.
Latent caching:
- Action: Delete the `cache/` directory under your data root whenever you modify training data.
- Why: Latent encodings are cached to disk per-model and per-resolution. Stale caches cause the model to train on old data even after you update videos.
Import ordering:
- Action: Always import `decord` AFTER `torch`.
- Why: Importing decord before torch can cause segmentation faults on some systems (known decord issue).
Short video handling:
- Behavior: Videos shorter than `max_num_frames` are automatically padded by repeating the last frame.
- Implication: This means short clips can be used, but the model may learn to generate static endings.
I2V conditioning:
- Behavior: If no `--image_column` is specified for I2V training, the system automatically extracts the first frame from each video as the conditioning image.
Identifier tokens:
- Action: Use `--id_token` to specify an identifier token for Dreambooth-style training.
- Why: Produces better training results by associating a unique token with the target concept.
Reasoning
Aspect ratio distortion warning from `finetune/README.md:80-82`:
"For samples that don't match the training resolution, the code will directly resize them. This may cause aspect ratio distortion and affect training results."
Latent cache warning from `finetune/README.md:84-86`:
"If you modify the data after training, please delete the latent directory under the video directory to ensure the latest data is used."
Decord import order from `finetune/datasets/utils.py:10-12`:
# Must import after torch because this can sometimes lead to a nasty
# segmentation fault, or stack smashing error
import decord # isort:skip
Center-crop-resize implementation from `inference/gradio_composite_demo/app.py:115-130`:
def center_crop_resize(input_video_path, target_width=720, target_height=480):
cap = cv2.VideoCapture(input_video_path)
# ... calculates resize_factor = max(width_factor, height_factor)
# ... applies center crop after resize to maintain aspect ratio