Implementation:NVIDIA NeMo Curator QwenVL

Knowledge Sources	NVIDIA NeMo Curator
Domains	Machine Learning, Computer Vision, Video Captioning
Last Updated	2026-02-14 00:00 GMT

Overview

Wraps the Qwen2.5-VL-7B-Instruct vision-language model for video captioning via vLLM, with support for optional two-stage caption generation.

Description

The QwenVL class implements ModelInterface and provides a multimodal inference wrapper around the Qwen/Qwen2.5-VL-7B-Instruct model (revision cc59489) using the vLLM inference engine.

On setup(), it initializes a vLLM LLM instance configured for multimodal input (1 video per prompt), with optional FP8 quantization, configurable preprocessing delegation (via model_does_preprocess), optional multimedia processor cache (mm_processor_cache_gb), and a maximum model length of 32768 tokens with 85% GPU memory utilization. Sampling parameters are set conservatively: temperature=0.1, top_p=0.001, repetition_penalty=1.05.

The generate method processes batches of video inputs using grouping.split_by_chunk_size for memory-efficient batching. It supports an optional two-stage captioning workflow:

Stage 1: Generates an initial caption from the video input
Stage 2: Substitutes the stage 1 caption into a stage2_prompt template (via regex pattern matching) and re-generates for an enhanced, more detailed caption

The model variant system maps string keys to HuggingFace model IDs (currently only "qwen" maps to Qwen/Qwen2.5-VL-7B-Instruct). If vLLM is not installed, dummy classes are provided for type compatibility, and setup() raises an ImportError at runtime.

Usage

Use QwenVL as the primary model for video captioning in the NeMo Curator data curation pipeline. It generates natural language descriptions of video content, with the two-stage approach enabling richer, more detailed captions suitable for training data annotation.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/models/qwen_vl.py
Lines: 1-157

Signature

class QwenVL(ModelInterface):
    def __init__(
        self,
        model_dir: str,
        model_variant: str,
        caption_batch_size: int,
        fp8: bool = True,
        max_output_tokens: int = 512,
        model_does_preprocess: bool = False,
        disable_mmcache: bool = False,
        stage2_prompt_text: str | None = None,
        verbose: bool = False,
    ): ...
    @property
    def model_id_names(self) -> list[str]: ...
    def setup(self) -> None: ...
    def generate(
        self,
        videos: list[dict[str, Any]],
        generate_stage2_caption: bool = False,
        batch_size: int = 16,
    ) -> list[str]: ...
    @classmethod
    def download_weights_on_node(cls, model_dir: str) -> None: ...

Import

from nemo_curator.models.qwen_vl import QwenVL

I/O Contract

Inputs (Constructor)

Name	Type	Required	Description
model_dir	str	Yes	Path to the directory where model weights are stored or will be downloaded
model_variant	str	Yes	Model variant key (currently only "qwen" is supported)
caption_batch_size	int	Yes	Batch size for caption generation
fp8	bool	No	Whether to use FP8 quantization (default: True)
max_output_tokens	int	No	Maximum number of tokens to generate per input (default: 512)
model_does_preprocess	bool	No	Whether the model handles its own preprocessing (default: False)
disable_mmcache	bool	No	Whether to disable the multimedia processor cache (default: False)
stage2_prompt_text	str or None	No	Template text for two-stage captioning; contains "user_prompt" placeholder (default: None)
verbose	bool	No	Enable verbose logging (default: False)

Inputs (generate)

Name	Type	Required	Description
videos	list[dict[str, Any]]	Yes	List of video input dictionaries formatted for vLLM (with "prompt" and "multi_modal_data" keys)
generate_stage2_caption	bool	No	Whether to perform two-stage caption generation (default: False)
batch_size	int	No	Number of videos to process per batch (default: 16)

Outputs

Name	Type	Description
captions	list[str]	List of generated caption strings, one per input video

Model Configuration

Parameter	Value
Model ID	Qwen/Qwen2.5-VL-7B-Instruct
Revision	cc59489
Temperature	0.1
top_p	0.001
Repetition penalty	1.05
Max model length	32768
GPU memory utilization	0.85
MM processor cache	4 GB (unless disabled)
Max batched tokens	32768
Quantization	Optional FP8 (default: enabled)

Two-Stage Captioning

When generate_stage2_caption=True and a stage2_prompt_text is provided:

The model first generates a caption from the video input (stage 1)
The stage 1 caption is appended to the stage2_prompt_text
The combined text replaces the user_prompt placeholder in the original prompt using a regex pattern (.*)(user_prompt)(.*)
The model generates again with the enriched prompt (stage 2)
Only the stage 2 output is returned

This approach enables the model to first understand the video content, then produce a more detailed and structured caption in the second pass.

Usage Examples

Basic Usage

from nemo_curator.models.qwen_vl import QwenVL

# Download weights first
QwenVL.download_weights_on_node("/path/to/models")

# Initialize and setup
model = QwenVL(
    model_dir="/path/to/models",
    model_variant="qwen",
    caption_batch_size=8,
    fp8=True,
    max_output_tokens=512,
)
model.setup()

# Generate captions for video inputs
video_inputs = [
    {"prompt": formatted_prompt, "multi_modal_data": {"video": video_tensor}},
]
captions = model.generate(video_inputs)
print(captions[0])

Two-Stage Caption Generation

from nemo_curator.models.qwen_vl import QwenVL

model = QwenVL(
    model_dir="/path/to/models",
    model_variant="qwen",
    caption_batch_size=8,
    stage2_prompt_text="Based on this initial description, provide a detailed caption: ",
)
model.setup()

captions = model.generate(
    video_inputs,
    generate_stage2_caption=True,
    batch_size=4,
)

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
NVIDIA_NeMo_Curator_ModelInterface -- Base class that QwenVL implements
NVIDIA_NeMo_Curator_QwenLM -- Text-only LM that pairs with QwenVL for caption enhancement
NVIDIA_NeMo_Curator_PromptFormatter -- Formats prompts for QwenVL input

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment