Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Zai org CogVideo CogVLM2 Model Loading

From Leeroopedia


Attribute Value
Implementation Name CogVLM2 Model Loading
Workflow Video Captioning
Step 2 of 5
Type Wrapper Doc
Source File tools/caption/video_caption.py:L60-69
Repository zai-org/CogVideo
External Dependencies transformers (AutoModelForCausalLM, AutoTokenizer)
Last Updated 2026-02-10 00:00 GMT

Overview

Implementation of CogVLM2 model and tokenizer loading for the video captioning pipeline. The model is loaded from the HuggingFace Hub or a local path with appropriate precision and set to evaluation mode.

Description

The model loading code:

  1. Loads the tokenizer using AutoTokenizer.from_pretrained with trust_remote_code=True
  2. Loads the model using AutoModelForCausalLM.from_pretrained with the selected torch dtype
  3. Sets the model to eval mode with .eval()
  4. Moves the model to the target device with .to(DEVICE)

The model path defaults to "THUDM/cogvlm2-llama3-caption", a CogVLM2 variant specifically fine-tuned for video captioning.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

MODEL_PATH = "THUDM/cogvlm2-llama3-caption"
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
DEVICE = "cuda"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, torch_dtype=TORCH_TYPE, trust_remote_code=True
).eval().to(DEVICE)

Code Reference

Source Location

File Lines Description
tools/caption/video_caption.py L60-69 Model and tokenizer loading

Signature

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_PATH,  # "THUDM/cogvlm2-llama3-caption"
    trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,  # bfloat16 or float16
    trust_remote_code=True
).eval().to(DEVICE)

Import

from transformers import AutoModelForCausalLM, AutoTokenizer

I/O Contract

Inputs

Parameter Type Default Description
MODEL_PATH str "THUDM/cogvlm2-llama3-caption" HuggingFace model ID or local path to CogVLM2 weights
TORCH_TYPE torch.dtype Auto-detected torch.bfloat16 if supported, else torch.float16
DEVICE str "cuda" Target device for model inference
trust_remote_code bool True Required for CogVLM2 custom model code

Outputs

Output Type Description
tokenizer AutoTokenizer Loaded Llama3-based tokenizer for text encoding/decoding
model AutoModelForCausalLM Loaded CogVLM2 model in eval mode on target device

Usage Examples

Example 1: Standard loading

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

MODEL_PATH = "THUDM/cogvlm2-llama3-caption"
TORCH_TYPE = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
DEVICE = "cuda"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, torch_dtype=TORCH_TYPE, trust_remote_code=True
).eval().to(DEVICE)

Example 2: Loading from local path

MODEL_PATH = "/models/cogvlm2-llama3-caption"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, torch_dtype=torch.bfloat16, trust_remote_code=True
).eval().to("cuda")

Example 3: Loading with 4-bit quantization

from transformers import BitsAndBytesConfig

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=TORCH_TYPE,
    trust_remote_code=True,
    quantization_config=quant_config,
).eval()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment