Principle:Zai org CogVideo Inference LoRA Loading
Overview
Technique for dynamically loading pre-trained LoRA adapter weights into an inference pipeline for customized video generation.
Description
At inference time, LoRA (Low-Rank Adaptation) adapters trained on custom datasets can be loaded into the base pipeline. The adapter weights are loaded from safetensors files and can be fused into the transformer for zero-overhead inference, or kept separate for dynamic switching between multiple adapters.
The LoRA loading workflow involves two key steps:
- Loading -- The adapter weights are read from a
.safetensorsfile and registered as named adapters on the pipeline's transformer component - Fusing -- The low-rank weight matrices are merged directly into the base model weights, eliminating any runtime overhead from the adaptation
When fusing is performed, the adapter weights are permanently merged into the model via the formula W' = W + scale * B @ A, where B and A are the low-rank matrices and scale controls the adaptation strength.
Usage
Use when generating videos with a fine-tuned CogVideoX model. This is an optional step -- skip if using the base model without fine-tuning.
Typical workflow:
- Load the base pipeline with
CogVideoXPipeline.from_pretrained() - Load LoRA weights with
pipe.load_lora_weights() - Fuse the weights with
pipe.fuse_lora() - Proceed with scheduler configuration and generation
Theoretical Basis
LoRA adapters add low-rank weight matrices to attention layers in the transformer. For a pretrained weight matrix W, the adapted weight is:
- W' = W + scale * B @ A
Where:
- B is a matrix of shape
(d, r) - A is a matrix of shape
(r, k) - r is the rank (much smaller than d and k)
- scale controls the adaptation strength (default 1.0)
Fusing the weights (W' = W + scale * BA) eliminates runtime overhead since the adapted weights replace the original weights directly. The scale parameter allows controlling how strongly the fine-tuned behavior influences generation.