Principle:Zai org CogVideo I2V Video Export
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Knowledge Sources | Repo (CogVideo), Paper (CogVideoX) |
| Domains | Video_Generation, Diffusion_Models, Image_Conditioning |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Technique for encoding image-to-video generated frames into a playable MP4 file.
Description
After generating video frames from the I2V pipeline, they are encoded into MP4 format. This is functionally identical to T2V export but documented separately as part of the I2V workflow for completeness and traceability.
Export Process
The export process takes the list of PIL Image frames produced by the I2V pipeline and encodes them sequentially into an H.264-compressed MP4 video container. Each frame is written at the specified frame rate (fps), which determines the playback speed of the resulting video.
Frame Rate
The default frame rate of 16 fps is used for CogVideoX models. With the default 81 frames, this produces approximately 5 seconds of video. The frame rate can be adjusted to produce slower or faster playback without changing the number of generated frames.
Output Format
The output is a standard MP4 file with H.264 video encoding. This format is widely supported across video players, web browsers, and video editing software.
Usage
Use as the final step in the I2V pipeline after obtaining generated frames from the pipeline call. The export function takes the list of PIL Image frames and writes them to disk as an MP4 file.
Typical workflow:
- Generate frames via the I2V pipeline call.
- Access the frames via
output.frames[0]. - Export the frames to an MP4 file using
export_to_video.
Theoretical Basis
Sequential Frame Encoding
Video export follows the standard approach of sequential frame encoding into a compressed video container. Each PIL Image frame is converted to a raw pixel array and passed to an H.264 encoder, which compresses the sequence using:
- Intra-frame compression (I-frames): Individual frames are compressed using spatial redundancy within the frame.
- Inter-frame compression (P-frames and B-frames): Temporal redundancy between consecutive frames is exploited to achieve higher compression ratios.
The H.264 codec is chosen for its excellent balance of compression efficiency, decoding speed, and universal hardware support. The resulting MP4 container is a standardized format (ISO/IEC 14496-14) that encapsulates the compressed video stream with metadata including frame rate and resolution.
Frame Rate and Temporal Perception
The frame rate of 16 fps is selected to match the temporal resolution at which the CogVideoX model was trained. Using a different frame rate during export changes the perceived speed of motion but does not alter the content of the generated frames. Higher frame rates (e.g., 24 or 30 fps) would produce faster playback, while lower frame rates would produce slower playback.