Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo SAT Inference Configuration

From Leeroopedia


Attribute Value
Principle Name SAT Inference Configuration
Workflow SAT Video Generation
Step 1 of 5
Type Configuration
Repository zai-org/CogVideo
Paper CogVideoX
Last Updated 2026-02-10 00:00 GMT

Overview

Technique for configuring SAT inference parameters including sampling resolution, frame count, and prompt input method. SAT inference configuration extends the training configuration with inference-specific parameters, enabling precise control over video generation outputs without modifying the underlying model architecture.

Description

SAT inference configuration extends the training configuration with inference-specific parameters: sampling image size, frame count, FPS, input method (CLI or file), and image-to-video (I2V) flag. Configuration is loaded from YAML files and overridden via command-line arguments.

The configuration system operates in two layers:

  1. Base configuration is loaded from YAML files specified via the --base argument. These files define model architecture, conditioner settings, and sampler parameters.
  2. CLI overrides add inference-specific parameters such as --sampling-image-size, --sampling-num-frames, --sampling-fps, --input-type, --input-file, and --image2video.

This layered approach allows the same YAML config used during training to be reused for inference, with only generation-specific parameters added at invocation time.

Usage

Use SAT Inference Configuration when preparing to run video generation with the SAT-based CogVideoX pipeline. The configuration step must precede model loading, prompt input, and sampling. Typical invocation:

python sample_video.py \
    --base configs/cogvideox_2b.yaml \
    --sampling-image-size 768 1360 \
    --sampling-num-frames 32 \
    --sampling-fps 8 \
    --input-type cli \
    --image2video

Theoretical Basis

Inference configuration separates generation parameters from model architecture, enabling the same trained model to generate at different resolutions and frame counts. This separation of concerns follows the principle of configuration over code: the model weights and architecture remain fixed, while output characteristics (resolution, duration, frame rate) are controlled entirely through external parameters.

Key design decisions include:

  • Resolution independence: The diffusion model operates in latent space, so output resolution is controlled by the sampling image size divided by the VAE downsampling factor (typically 8).
  • Temporal flexibility: Frame count determines the number of latent frames processed by the temporal attention layers, constrained only by GPU memory.
  • Input abstraction: CLI and file-based input modes are interchangeable from the model's perspective, allowing both interactive and batch workflows.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment