Principle:Zai org CogVideo LoRA Export and Inference
| Principle Metadata | |
|---|---|
| Name | LoRA_Export_and_Inference |
| Category | Inference |
| Domains | Video_Generation, Fine_Tuning, Diffusion_Models |
| Knowledge Sources | CogVideo Repository, CogVideoX Paper, LoRA Paper |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
LoRA Export and Inference is a technique for loading trained LoRA adapter weights into a base model pipeline and fusing them for efficient inference.
Description
After LoRA fine-tuning, adapter weights are saved separately from the base model as compact .safetensors files (typically 50-200 MB). For inference, these weights are loaded into the pipeline using the load_lora_weights method and can optionally be fused (merged) into the base transformer weights for faster execution without per-token LoRA overhead.
The inference workflow consists of:
- Load the base pipeline: Instantiate the full CogVideoX pipeline from the pretrained model.
- Load LoRA weights: Attach the trained adapter weights to the pipeline's transformer.
- Optionally fuse weights: Merge LoRA matrices into the base weights for inference speed.
- Generate videos: Run the diffusion sampling loop to produce videos from text prompts.
Multiple adapters can also be loaded simultaneously and combined with different scaling factors, enabling compositional adaptation (e.g., combining a style adapter with a subject adapter).
Usage
Use after completing LoRA fine-tuning to generate videos with the adapted model. Fusing is recommended for production inference as it eliminates the per-layer LoRA computation overhead. Unfused mode is preferred when dynamically switching between multiple adapters or when experimenting with different adapter scaling weights.
Theoretical Basis
LoRA Fusion
LoRA fusion merges the low-rank adaptation back into the base weights:
- W' = W + alpha * B * A
where:
- W is the original pretrained weight matrix
- B and A are the low-rank adapter matrices
- alpha is the
lora_scaleparameter controlling fusion strength (default 1.0)
After fusion, the adapted model has the same computational cost as the base model during inference, since the LoRA matrices have been absorbed into the weight matrices. The lora_scale parameter provides continuous control over the adaptation strength:
lora_scale=0.0: No adaptation (base model behavior)lora_scale=0.5: Half-strength adaptationlora_scale=1.0: Full adaptation (default)
Multi-Adapter Composition
Multiple LoRA adapters can be composed additively:
- W' = W + sum_i(alpha_i * B_i * A_i)
This enables combining specialized adapters trained for different aspects (e.g., one for style, one for subject matter). The set_adapters method allows setting per-adapter weights for fine-grained control over the composition.
Inference Pipeline
The CogVideoX inference pipeline uses the DPM scheduler for iterative denoising. Starting from pure Gaussian noise, the pipeline:
- Encodes the text prompt via the T5 encoder.
- Iteratively denoises the video latents over a configured number of steps.
- Decodes the final latents back to pixel space via the VAE decoder.
- Exports the frames as a video file.