Principle:Zai org CogVideo SAT Inference Configuration
| Attribute | Value |
|---|---|
| Principle Name | SAT Inference Configuration |
| Workflow | SAT Video Generation |
| Step | 1 of 5 |
| Type | Configuration |
| Repository | zai-org/CogVideo |
| Paper | CogVideoX |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Technique for configuring SAT inference parameters including sampling resolution, frame count, and prompt input method. SAT inference configuration extends the training configuration with inference-specific parameters, enabling precise control over video generation outputs without modifying the underlying model architecture.
Description
SAT inference configuration extends the training configuration with inference-specific parameters: sampling image size, frame count, FPS, input method (CLI or file), and image-to-video (I2V) flag. Configuration is loaded from YAML files and overridden via command-line arguments.
The configuration system operates in two layers:
- Base configuration is loaded from YAML files specified via the
--baseargument. These files define model architecture, conditioner settings, and sampler parameters. - CLI overrides add inference-specific parameters such as
--sampling-image-size,--sampling-num-frames,--sampling-fps,--input-type,--input-file, and--image2video.
This layered approach allows the same YAML config used during training to be reused for inference, with only generation-specific parameters added at invocation time.
Usage
Use SAT Inference Configuration when preparing to run video generation with the SAT-based CogVideoX pipeline. The configuration step must precede model loading, prompt input, and sampling. Typical invocation:
python sample_video.py \
--base configs/cogvideox_2b.yaml \
--sampling-image-size 768 1360 \
--sampling-num-frames 32 \
--sampling-fps 8 \
--input-type cli \
--image2video
Theoretical Basis
Inference configuration separates generation parameters from model architecture, enabling the same trained model to generate at different resolutions and frame counts. This separation of concerns follows the principle of configuration over code: the model weights and architecture remain fixed, while output characteristics (resolution, duration, frame rate) are controlled entirely through external parameters.
Key design decisions include:
- Resolution independence: The diffusion model operates in latent space, so output resolution is controlled by the sampling image size divided by the VAE downsampling factor (typically 8).
- Temporal flexibility: Frame count determines the number of latent frames processed by the temporal attention layers, constrained only by GPU memory.
- Input abstraction: CLI and file-based input modes are interchangeable from the model's perspective, allowing both interactive and batch workflows.
Related Pages
- Implementation:Zai_org_CogVideo_SAT_Inference_Get_Args -- Implementation of argument parsing for SAT inference
- Zai_org_CogVideo_SAT_Model_Loading_for_Inference -- Next step: loading the model with parsed configuration
- Zai_org_CogVideo_SAT_Prompt_Input -- Prompt input controlled by
--input-typeparameter - Zai_org_CogVideo_Diffusion_Sampling -- Sampling step that consumes resolution and frame count parameters