Principle:Zai org CogVideo SAT Inference Configuration

Attribute	Value
Principle Name	SAT Inference Configuration
Workflow	SAT Video Generation
Step	1 of 5
Type	Configuration
Repository	zai-org/CogVideo
Paper	CogVideoX
Last Updated	2026-02-10 00:00 GMT

Overview

Technique for configuring SAT inference parameters including sampling resolution, frame count, and prompt input method. SAT inference configuration extends the training configuration with inference-specific parameters, enabling precise control over video generation outputs without modifying the underlying model architecture.

Description

SAT inference configuration extends the training configuration with inference-specific parameters: sampling image size, frame count, FPS, input method (CLI or file), and image-to-video (I2V) flag. Configuration is loaded from YAML files and overridden via command-line arguments.

The configuration system operates in two layers:

Base configuration is loaded from YAML files specified via the --base argument. These files define model architecture, conditioner settings, and sampler parameters.
CLI overrides add inference-specific parameters such as --sampling-image-size, --sampling-num-frames, --sampling-fps, --input-type, --input-file, and --image2video.

This layered approach allows the same YAML config used during training to be reused for inference, with only generation-specific parameters added at invocation time.

Usage

Use SAT Inference Configuration when preparing to run video generation with the SAT-based CogVideoX pipeline. The configuration step must precede model loading, prompt input, and sampling. Typical invocation:

python sample_video.py \
    --base configs/cogvideox_2b.yaml \
    --sampling-image-size 768 1360 \
    --sampling-num-frames 32 \
    --sampling-fps 8 \
    --input-type cli \
    --image2video

Theoretical Basis

Inference configuration separates generation parameters from model architecture, enabling the same trained model to generate at different resolutions and frame counts. This separation of concerns follows the principle of configuration over code: the model weights and architecture remain fixed, while output characteristics (resolution, duration, frame rate) are controlled entirely through external parameters.

Key design decisions include:

Resolution independence: The diffusion model operates in latent space, so output resolution is controlled by the sampling image size divided by the VAE downsampling factor (typically 8).
Temporal flexibility: Frame count determines the number of latent frames processed by the temporal attention layers, constrained only by GPU memory.
Input abstraction: CLI and file-based input modes are interchangeable from the model's perspective, allowing both interactive and batch workflows.

Related Pages

Implementation:Zai_org_CogVideo_SAT_Inference_Get_Args -- Implementation of argument parsing for SAT inference
Zai_org_CogVideo_SAT_Model_Loading_for_Inference -- Next step: loading the model with parsed configuration
Zai_org_CogVideo_SAT_Prompt_Input -- Prompt input controlled by --input-type parameter
Zai_org_CogVideo_Diffusion_Sampling -- Sampling step that consumes resolution and frame count parameters

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment