Principle:Zai org CogVideo Model Loading and LoRA Injection

Principle Metadata
Name	Model_Loading_and_LoRA_Injection
Category	Model_Architecture
Domains	Video_Generation, Fine_Tuning, Diffusion_Models
Knowledge Sources	CogVideo Repository, CogVideoX Paper, LoRA Paper
Last Updated	2026-02-10 00:00 GMT

Overview

Model Loading and LoRA Injection is a technique for loading pretrained video diffusion model components and injecting Low-Rank Adaptation (LoRA) adapters for parameter-efficient fine-tuning.

Description

Loading a CogVideoX model involves separately instantiating its sub-components (tokenizer, T5 text encoder, CogVideoX transformer, VAE, scheduler) from a pretrained checkpoint. LoRA injection then adds low-rank adapter matrices to specified attention modules (to_q, to_k, to_v, to_out) of the transformer, allowing fine-tuning with drastically fewer trainable parameters.

The loading process follows a specific order:

Tokenizer: Loaded from the pretrained model's tokenizer subdirectory using AutoTokenizer.
Text Encoder: T5 encoder model loaded for computing text conditioning embeddings.
Transformer: The core CogVideoXTransformer3DModel that performs the denoising diffusion process.
VAE: AutoencoderKLCogVideoX for encoding videos to latent space and decoding back to pixel space.
Scheduler: CogVideoXDPMScheduler for managing the noise schedule during training and inference.

After loading, LoRA adapters are injected into the transformer using PEFT's LoraConfig. Only the LoRA parameters are set to require gradients; all other model parameters remain frozen.

Usage

Use when fine-tuning CogVideoX models with limited GPU memory or when wanting to preserve the base model weights and create swappable adapters. LoRA fine-tuning is the recommended approach for most users as it requires significantly less VRAM than full fine-tuning and produces compact adapter files (~50-200 MB vs. multi-GB full checkpoints).

Theoretical Basis

LoRA (Low-Rank Adaptation) decomposes weight updates as:

W' = W + BA

where B is in R^{d x r}, A is in R^{r x k}, and rank r is much smaller than min(d, k). This reduces the number of trainable parameters from d * k to (d + k) * r.

The lora_alpha scaling factor controls the magnitude of the adaptation. The effective scaling applied to the LoRA output is lora_alpha / rank. For CogVideoX:

Default rank: r = 128
Default alpha: lora_alpha = 64
Effective scaling: 64 / 128 = 0.5

The target modules are the attention projection layers in the CogVideoX transformer:

to_q -- Query projection
to_k -- Key projection
to_v -- Value projection
to_out.0 -- Output projection

These layers are chosen because attention projections are the primary mechanism for learning content-specific patterns, while other layers (feed-forward networks, normalization) capture more general structural information.

Components that are only needed during encoding (text encoder, VAE) are placed on the UNLOAD_LIST and offloaded from GPU memory after their latents have been pre-computed, freeing VRAM for the transformer during training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment