Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Zai org CogVideo Inference LoRA Loading

From Leeroopedia


Template:Principle

Overview

Technique for dynamically loading pre-trained LoRA adapter weights into an inference pipeline for customized video generation.

Description

At inference time, LoRA (Low-Rank Adaptation) adapters trained on custom datasets can be loaded into the base pipeline. The adapter weights are loaded from safetensors files and can be fused into the transformer for zero-overhead inference, or kept separate for dynamic switching between multiple adapters.

The LoRA loading workflow involves two key steps:

  • Loading -- The adapter weights are read from a .safetensors file and registered as named adapters on the pipeline's transformer component
  • Fusing -- The low-rank weight matrices are merged directly into the base model weights, eliminating any runtime overhead from the adaptation

When fusing is performed, the adapter weights are permanently merged into the model via the formula W' = W + scale * B @ A, where B and A are the low-rank matrices and scale controls the adaptation strength.

Usage

Use when generating videos with a fine-tuned CogVideoX model. This is an optional step -- skip if using the base model without fine-tuning.

Typical workflow:

  1. Load the base pipeline with CogVideoXPipeline.from_pretrained()
  2. Load LoRA weights with pipe.load_lora_weights()
  3. Fuse the weights with pipe.fuse_lora()
  4. Proceed with scheduler configuration and generation

Theoretical Basis

LoRA adapters add low-rank weight matrices to attention layers in the transformer. For a pretrained weight matrix W, the adapted weight is:

W' = W + scale * B @ A

Where:

  • B is a matrix of shape (d, r)
  • A is a matrix of shape (r, k)
  • r is the rank (much smaller than d and k)
  • scale controls the adaptation strength (default 1.0)

Fusing the weights (W' = W + scale * BA) eliminates runtime overhead since the adapted weights replace the original weights directly. The scale parameter allows controlling how strongly the fine-tuned behavior influences generation.

Knowledge Sources

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment