Implementation:Zai org CogVideo CogVLM2 Model Loading
Appearance
| Attribute | Value |
|---|---|
| Implementation Name | CogVLM2 Model Loading |
| Workflow | Video Captioning |
| Step | 2 of 5 |
| Type | Wrapper Doc |
| Source File | tools/caption/video_caption.py:L60-69
|
| Repository | zai-org/CogVideo |
| External Dependencies | transformers (AutoModelForCausalLM, AutoTokenizer) |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Implementation of CogVLM2 model and tokenizer loading for the video captioning pipeline. The model is loaded from the HuggingFace Hub or a local path with appropriate precision and set to evaluation mode.
Description
The model loading code:
- Loads the tokenizer using
AutoTokenizer.from_pretrainedwithtrust_remote_code=True - Loads the model using
AutoModelForCausalLM.from_pretrainedwith the selected torch dtype - Sets the model to eval mode with
.eval() - Moves the model to the target device with
.to(DEVICE)
The model path defaults to "THUDM/cogvlm2-llama3-caption", a CogVLM2 variant specifically fine-tuned for video captioning.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
MODEL_PATH = "THUDM/cogvlm2-llama3-caption"
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
DEVICE = "cuda"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH, torch_dtype=TORCH_TYPE, trust_remote_code=True
).eval().to(DEVICE)
Code Reference
Source Location
| File | Lines | Description |
|---|---|---|
tools/caption/video_caption.py |
L60-69 | Model and tokenizer loading |
Signature
tokenizer = AutoTokenizer.from_pretrained(
MODEL_PATH, # "THUDM/cogvlm2-llama3-caption"
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE, # bfloat16 or float16
trust_remote_code=True
).eval().to(DEVICE)
Import
from transformers import AutoModelForCausalLM, AutoTokenizer
I/O Contract
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
MODEL_PATH |
str |
"THUDM/cogvlm2-llama3-caption" |
HuggingFace model ID or local path to CogVLM2 weights |
TORCH_TYPE |
torch.dtype |
Auto-detected | torch.bfloat16 if supported, else torch.float16
|
DEVICE |
str |
"cuda" |
Target device for model inference |
trust_remote_code |
bool |
True |
Required for CogVLM2 custom model code |
Outputs
| Output | Type | Description |
|---|---|---|
tokenizer |
AutoTokenizer |
Loaded Llama3-based tokenizer for text encoding/decoding |
model |
AutoModelForCausalLM |
Loaded CogVLM2 model in eval mode on target device |
Usage Examples
Example 1: Standard loading
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
MODEL_PATH = "THUDM/cogvlm2-llama3-caption"
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
DEVICE = "cuda"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH, torch_dtype=TORCH_TYPE, trust_remote_code=True
).eval().to(DEVICE)
Example 2: Loading from local path
MODEL_PATH = "/models/cogvlm2-llama3-caption"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH, torch_dtype=torch.bfloat16, trust_remote_code=True
).eval().to("cuda")
Example 3: Loading with 4-bit quantization
from transformers import BitsAndBytesConfig
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_PATH,
torch_dtype=TORCH_TYPE,
trust_remote_code=True,
quantization_config=quant_config,
).eval()
Related Pages
- Principle:Zai_org_CogVideo_Caption_Model_Loading -- Principle governing caption model loading
- Environment:Zai_org_CogVideo_Video_Captioning_Environment
- Heuristic:Zai_org_CogVideo_BF16_FP16_Precision_Selection
- Zai_org_CogVideo_Captioning_Requirements_Install -- Previous step: environment setup
- Zai_org_CogVideo_Caption_Load_Video -- Next step: loading video frames for captioning
- Zai_org_CogVideo_CogVLM2_Predict -- Prediction step using the loaded model
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment