Workflow:Zai org CogVideo Diffusers LoRA Finetuning
| Knowledge Sources | |
|---|---|
| Domains | Video_Generation, Fine_Tuning, LoRA |
| Last Updated | 2026-02-10 12:00 GMT |
Overview
End-to-end process for parameter-efficient fine-tuning (LoRA) of CogVideoX text-to-video and image-to-video models using the HuggingFace Diffusers framework with DeepSpeed or DDP distributed training.
Description
This workflow covers the complete procedure for adapting pre-trained CogVideoX video generation models to custom domains or styles using Low-Rank Adaptation (LoRA). It uses the Diffusers-based fine-tuning pipeline with Pydantic-validated configuration, HuggingFace Accelerate for distributed training orchestration, and optional DeepSpeed ZeRO Stage 2/3 for memory-efficient training. The pipeline supports all CogVideoX model variants (2B, 5B, 1.5-5B) for both text-to-video (T2V) and image-to-video (I2V) tasks. Training can run on consumer GPUs with as little as 16GB VRAM using LoRA, compared to full SFT which requires significantly more memory.
Usage
Execute this workflow when you have a collection of video clips (with corresponding text captions) and need to adapt a CogVideoX model to generate videos in a specific style, domain, or subject. This is the recommended fine-tuning approach when GPU memory is limited (under 80GB VRAM) or when you want to preserve the base model's general capabilities while adding specialized behavior.
Execution Steps
Step 1: Dataset Preparation
Organize training data into the expected directory structure. Create a data root directory containing video files and two text index files: one listing video file paths (one per line) and one listing corresponding text captions (one per line). For I2V tasks, also extract first frames from each video using the provided extraction script. Videos should match the target resolution specified during training (e.g., 81x768x1360 for CogVideoX1.5, 49x480x720 for CogVideoX).
Key considerations:
- Video frame count must follow the 8N+1 rule (e.g., 49 or 81 frames)
- Captions and video paths must be aligned line-by-line
- For I2V, first frames are extracted using the provided `extract_images.py` script
- Supported video formats include MP4
Step 2: Configuration
Define training parameters through command-line arguments or a shell launch script. Configuration covers five categories: model settings (model path, model name, training type), output settings (output directory, logging backend), data settings (data root, resolution), training hyperparameters (epochs, batch size, learning rate, mixed precision), and checkpointing/validation settings. The Pydantic-based Args schema validates all parameters before training begins.
Key considerations:
- Select the correct model_name for your variant (e.g., "cogvideox1.5-t2v", "cogvideox-t2v", "cogvideox-i2v")
- Set training_type to "lora" for parameter-efficient fine-tuning
- Only CogVideoX-2B supports fp16; all others require bf16 mixed precision
- Resolution must match the model variant's expected input dimensions
Step 3: Model Loading and LoRA Injection
Load the pre-trained CogVideoX pipeline components: the 3D VAE (AutoencoderKLCogVideoX), the T5-XXL text encoder, and the CogVideoX transformer. Freeze all base model parameters, then inject low-rank adapter matrices into the transformer's attention and feedforward layers using the PEFT library. Only the LoRA adapter weights (typically less than 1% of total parameters) will be trained.
Key considerations:
- The VAE and text encoder remain frozen throughout training
- LoRA rank and alpha are configurable (default rank 128)
- Target modules for LoRA injection are specified in the configuration
- The model is loaded with the specified precision (bf16 or fp16)
Step 4: Distributed Training Setup
Initialize the distributed training environment using HuggingFace Accelerate. This handles DDP (DistributedDataParallel) or DeepSpeed ZeRO Stage 2/3 configuration, gradient accumulation, mixed precision training, and multi-GPU coordination. The Accelerator wraps the model, optimizer, and data loader for seamless distributed execution.
Key considerations:
- DDP is used for standard multi-GPU training via `accelerate launch`
- DeepSpeed ZeRO 2/3 enables training larger models with limited VRAM
- Gradient accumulation steps effectively increase batch size
- NCCL timeout should be set appropriately for large models (default 1800s)
Step 5: Training Loop
Execute the training loop over the configured number of epochs. For each batch: encode videos to latent space using the frozen VAE, encode text captions using the frozen T5 encoder, sample random noise and timesteps, predict the noise using the transformer with LoRA adapters, and compute the diffusion training loss. Gradients are accumulated and optimizer steps are taken according to the configured schedule.
Key considerations:
- VAE encoding is done with tiling and slicing for memory efficiency
- Text encoding is cached where possible to avoid redundant computation
- The training loss is the standard diffusion denoising objective
- Learning rate follows a configurable schedule (cosine, constant, etc.)
Step 6: Checkpointing and Validation
Save model checkpoints at configured intervals. Checkpoints include only the LoRA adapter weights (not the full model), keeping storage requirements minimal. Optionally run validation at checkpoint intervals by generating sample videos from validation prompts and logging them to TensorBoard or Weights & Biases.
Key considerations:
- Checkpoint limit controls the maximum number of saved checkpoints (oldest are deleted)
- Training can be resumed from any saved checkpoint
- Validation generates actual video samples to visually assess quality
- LoRA weights are saved as safetensors format
Step 7: Export and Inference
After training completes, the LoRA adapter weights can be loaded into the base CogVideoX pipeline for inference. Load the base model, apply the LoRA weights using `load_lora_weights`, optionally fuse the LoRA layers into the base model for faster inference, and generate videos using the standard Diffusers pipeline.
Key considerations:
- LoRA weights are small and portable (typically a few hundred MB)
- Weights can be fused into the base model for inference speed
- The same inference pipeline supports T2V, I2V, and V2V generation
- CPU offloading and VAE tiling can be used to reduce inference memory