Principle:Zai org CogVideo LoRA Export and Inference

Principle Metadata
Name	LoRA_Export_and_Inference
Category	Inference
Domains	Video_Generation, Fine_Tuning, Diffusion_Models
Knowledge Sources	CogVideo Repository, CogVideoX Paper, LoRA Paper
Last Updated	2026-02-10 00:00 GMT

Overview

LoRA Export and Inference is a technique for loading trained LoRA adapter weights into a base model pipeline and fusing them for efficient inference.

Description

After LoRA fine-tuning, adapter weights are saved separately from the base model as compact .safetensors files (typically 50-200 MB). For inference, these weights are loaded into the pipeline using the load_lora_weights method and can optionally be fused (merged) into the base transformer weights for faster execution without per-token LoRA overhead.

The inference workflow consists of:

Load the base pipeline: Instantiate the full CogVideoX pipeline from the pretrained model.
Load LoRA weights: Attach the trained adapter weights to the pipeline's transformer.
Optionally fuse weights: Merge LoRA matrices into the base weights for inference speed.
Generate videos: Run the diffusion sampling loop to produce videos from text prompts.

Multiple adapters can also be loaded simultaneously and combined with different scaling factors, enabling compositional adaptation (e.g., combining a style adapter with a subject adapter).

Usage

Use after completing LoRA fine-tuning to generate videos with the adapted model. Fusing is recommended for production inference as it eliminates the per-layer LoRA computation overhead. Unfused mode is preferred when dynamically switching between multiple adapters or when experimenting with different adapter scaling weights.

Theoretical Basis

LoRA Fusion

LoRA fusion merges the low-rank adaptation back into the base weights:

W' = W + alpha * B * A

where:

W is the original pretrained weight matrix
B and A are the low-rank adapter matrices
alpha is the lora_scale parameter controlling fusion strength (default 1.0)

After fusion, the adapted model has the same computational cost as the base model during inference, since the LoRA matrices have been absorbed into the weight matrices. The lora_scale parameter provides continuous control over the adaptation strength:

lora_scale=0.0: No adaptation (base model behavior)
lora_scale=0.5: Half-strength adaptation
lora_scale=1.0: Full adaptation (default)

Multi-Adapter Composition

Multiple LoRA adapters can be composed additively:

W' = W + sum_i(alpha_i * B_i * A_i)

This enables combining specialized adapters trained for different aspects (e.g., one for style, one for subject matter). The set_adapters method allows setting per-adapter weights for fine-grained control over the composition.

Inference Pipeline

The CogVideoX inference pipeline uses the DPM scheduler for iterative denoising. Starting from pure Gaussian noise, the pipeline:

Encodes the text prompt via the T5 encoder.
Iteratively denoises the video latents over a configured number of steps.
Decodes the final latents back to pixel space via the VAE decoder.
Exports the frames as a video file.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment