Heuristic:Zai org CogVideo Frame Count and Resolution Constraints
| Knowledge Sources | |
|---|---|
| Domains | Video_Generation, Configuration, Architecture_Constraint |
| Last Updated | 2026-02-10 02:00 GMT |
Overview
Frame count must follow 8N+1 (CogVideoX) or 16N+1 (CogVideoX1.5) formula, and CogVideoX-5B is locked to 480x720 resolution. Only I2V models support custom resolution.
Description
CogVideoX models have strict architectural constraints on frame count and spatial resolution. The frame count formula derives from the VAE temporal compression ratio (4x) and the transformer patch temporal size (2x), requiring frames to be `8N+1` for CogVideoX or `16N+1` for CogVideoX1.5. The +1 accounts for the reference/conditioning frame. Resolution constraints come from the fixed positional encodings used during pre-training. Violating either constraint causes tensor shape mismatches or significantly degraded output quality.
Usage
Apply these constraints before configuring any training or inference run to ensure valid frame count and resolution parameters. Check whenever users specify custom `--num_frames`, `--height`, or `--width` arguments.
The Insight (Rule of Thumb)
Frame Count:
- CogVideoX (2B/5B): Frames = 8N + 1, where N <= 6. Valid values: 9, 17, 25, 33, 41, 49 (default).
- CogVideoX1.5 (5B): Frames = 16N + 1, where N <= 10. Valid values: 17, 33, 49, 65, 81 (default), 97, 113, 129, 145, 161.
- SAT inference: Uses internal latent frame count. CogVideoX-5B/2B: 9, 11, or 13. CogVideoX1.5-5B: 42 or 22.
Resolution:
- CogVideoX-2B: Fixed at 480 x 720 pixels.
- CogVideoX-5B: Fixed at 480 x 720 pixels. The code raises a ValueError if different.
- CogVideoX1.5-5B: Default 768 x 1360 pixels.
- I2V models only support custom resolution (adapts to input image). T2V models force back to default.
FPS:
- CogVideoX: 8 fps (default).
- CogVideoX1.5: 16 fps (default).
Prompt length:
- CogVideoX: Maximum 226 tokens.
- CogVideoX1.5: Maximum 224 tokens.
Short video handling:
- Videos shorter than the target frame count are padded by repeating the last frame.
Reasoning
The frame count formula `8N+1` comes from the model architecture: VAE temporal compression ratio of 4x combined with transformer temporal patch size of 2x gives `4 * 2 = 8`. The +1 is the conditioning frame that anchors the video.
Resolution constraints come from `finetune/schemas/args.py:151-157`:
model_name = info.data.get("model_name", "")
if model_name in ["cogvideox-5b-i2v", "cogvideox-5b-t2v"]:
if (height, width) != (480, 720):
raise ValueError(
"For cogvideox-5b models, height must be 480 and width must be 720"
)
Frame validation from `finetune/schemas/args.py:147-149`:
if (frames - 1) % 8 != 0:
raise ValueError("Number of frames - 1 must be a multiple of 8")
Last-frame padding from `finetune/datasets/utils.py:132-140`:
if video_num_frames < max_num_frames:
frames = video_reader.get_batch(list(range(video_num_frames)))
last_frame = frames[-1:]
num_repeats = max_num_frames - video_num_frames
repeated_frames = last_frame.repeat(num_repeats, 1, 1, 1)
frames = torch.cat([frames, repeated_frames], dim=0)
Related Pages
- Implementation:Zai_org_CogVideo_Args_Parse_Args
- Implementation:Zai_org_CogVideo_T2V_I2V_Dataset_Loader
- Implementation:Zai_org_CogVideo_CogVideoXPipeline_Call
- Implementation:Zai_org_CogVideo_CogVideoXI2VPipeline_Call
- Principle:Zai_org_CogVideo_Training_Configuration
- Principle:Zai_org_CogVideo_Text_to_Video_Generation