Implementation:Zai org CogVideo CogVLM2 Model Loading

Attribute	Value
Implementation Name	CogVLM2 Model Loading
Workflow	Video Captioning
Step	2 of 5
Type	Wrapper Doc
Source File	`tools/caption/video_caption.py:L60-69`
Repository	zai-org/CogVideo
External Dependencies	transformers (AutoModelForCausalLM, AutoTokenizer)
Last Updated	2026-02-10 00:00 GMT

Overview

Implementation of CogVLM2 model and tokenizer loading for the video captioning pipeline. The model is loaded from the HuggingFace Hub or a local path with appropriate precision and set to evaluation mode.

Description

The model loading code:

Loads the tokenizer using AutoTokenizer.from_pretrained with trust_remote_code=True
Loads the model using AutoModelForCausalLM.from_pretrained with the selected torch dtype
Sets the model to eval mode with .eval()
Moves the model to the target device with .to(DEVICE)

The model path defaults to "THUDM/cogvlm2-llama3-caption", a CogVLM2 variant specifically fine-tuned for video captioning.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

MODEL_PATH = "THUDM/cogvlm2-llama3-caption"
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
DEVICE = "cuda"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, torch_dtype=TORCH_TYPE, trust_remote_code=True
).eval().to(DEVICE)

Code Reference

Source Location

File	Lines	Description
`tools/caption/video_caption.py`	L60-69	Model and tokenizer loading

Signature

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,  # "THUDM/cogvlm2-llama3-caption"
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,  # bfloat16 or float16
    trust_remote_code=True
).eval().to(DEVICE)

Import

from transformers import AutoModelForCausalLM, AutoTokenizer

I/O Contract

Inputs

Parameter	Type	Default	Description
`MODEL_PATH`	`str`	`"THUDM/cogvlm2-llama3-caption"`	HuggingFace model ID or local path to CogVLM2 weights
`TORCH_TYPE`	`torch.dtype`	Auto-detected	`torch.bfloat16` if supported, else `torch.float16`
`DEVICE`	`str`	`"cuda"`	Target device for model inference
`trust_remote_code`	`bool`	`True`	Required for CogVLM2 custom model code

Outputs

Output	Type	Description
`tokenizer`	`AutoTokenizer`	Loaded Llama3-based tokenizer for text encoding/decoding
`model`	`AutoModelForCausalLM`	Loaded CogVLM2 model in eval mode on target device

Usage Examples

Example 1: Standard loading

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

MODEL_PATH = "THUDM/cogvlm2-llama3-caption"
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
DEVICE = "cuda"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, torch_dtype=TORCH_TYPE, trust_remote_code=True
).eval().to(DEVICE)

Example 2: Loading from local path

MODEL_PATH = "/models/cogvlm2-llama3-caption"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, torch_dtype=torch.bfloat16, trust_remote_code=True
).eval().to("cuda")

Example 3: Loading with 4-bit quantization

from transformers import BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
    quantization_config=quant_config,
).eval()

Related Pages

Principle:Zai_org_CogVideo_Caption_Model_Loading -- Principle governing caption model loading
Environment:Zai_org_CogVideo_Video_Captioning_Environment
Heuristic:Zai_org_CogVideo_BF16_FP16_Precision_Selection
Zai_org_CogVideo_Captioning_Requirements_Install -- Previous step: environment setup
Zai_org_CogVideo_Caption_Load_Video -- Next step: loading video frames for captioning
Zai_org_CogVideo_CogVLM2_Predict -- Prediction step using the loaded model

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment