Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator QwenVL

From Leeroopedia
Knowledge Sources
Domains Machine Learning, Computer Vision, Video Captioning
Last Updated 2026-02-14 00:00 GMT

Overview

Wraps the Qwen2.5-VL-7B-Instruct vision-language model for video captioning via vLLM, with support for optional two-stage caption generation.

Description

The QwenVL class implements ModelInterface and provides a multimodal inference wrapper around the Qwen/Qwen2.5-VL-7B-Instruct model (revision cc59489) using the vLLM inference engine.

On setup(), it initializes a vLLM LLM instance configured for multimodal input (1 video per prompt), with optional FP8 quantization, configurable preprocessing delegation (via model_does_preprocess), optional multimedia processor cache (mm_processor_cache_gb), and a maximum model length of 32768 tokens with 85% GPU memory utilization. Sampling parameters are set conservatively: temperature=0.1, top_p=0.001, repetition_penalty=1.05.

The generate method processes batches of video inputs using grouping.split_by_chunk_size for memory-efficient batching. It supports an optional two-stage captioning workflow:

  1. Stage 1: Generates an initial caption from the video input
  2. Stage 2: Substitutes the stage 1 caption into a stage2_prompt template (via regex pattern matching) and re-generates for an enhanced, more detailed caption

The model variant system maps string keys to HuggingFace model IDs (currently only "qwen" maps to Qwen/Qwen2.5-VL-7B-Instruct). If vLLM is not installed, dummy classes are provided for type compatibility, and setup() raises an ImportError at runtime.

Usage

Use QwenVL as the primary model for video captioning in the NeMo Curator data curation pipeline. It generates natural language descriptions of video content, with the two-stage approach enabling richer, more detailed captions suitable for training data annotation.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/models/qwen_vl.py
  • Lines: 1-157

Signature

class QwenVL(ModelInterface):
    def __init__(
        self,
        model_dir: str,
        model_variant: str,
        caption_batch_size: int,
        fp8: bool = True,
        max_output_tokens: int = 512,
        model_does_preprocess: bool = False,
        disable_mmcache: bool = False,
        stage2_prompt_text: str | None = None,
        verbose: bool = False,
    ): ...
    @property
    def model_id_names(self) -> list[str]: ...
    def setup(self) -> None: ...
    def generate(
        self,
        videos: list[dict[str, Any]],
        generate_stage2_caption: bool = False,
        batch_size: int = 16,
    ) -> list[str]: ...
    @classmethod
    def download_weights_on_node(cls, model_dir: str) -> None: ...

Import

from nemo_curator.models.qwen_vl import QwenVL

I/O Contract

Inputs (Constructor)

Name Type Required Description
model_dir str Yes Path to the directory where model weights are stored or will be downloaded
model_variant str Yes Model variant key (currently only "qwen" is supported)
caption_batch_size int Yes Batch size for caption generation
fp8 bool No Whether to use FP8 quantization (default: True)
max_output_tokens int No Maximum number of tokens to generate per input (default: 512)
model_does_preprocess bool No Whether the model handles its own preprocessing (default: False)
disable_mmcache bool No Whether to disable the multimedia processor cache (default: False)
stage2_prompt_text str or None No Template text for two-stage captioning; contains "user_prompt" placeholder (default: None)
verbose bool No Enable verbose logging (default: False)

Inputs (generate)

Name Type Required Description
videos list[dict[str, Any]] Yes List of video input dictionaries formatted for vLLM (with "prompt" and "multi_modal_data" keys)
generate_stage2_caption bool No Whether to perform two-stage caption generation (default: False)
batch_size int No Number of videos to process per batch (default: 16)

Outputs

Name Type Description
captions list[str] List of generated caption strings, one per input video

Model Configuration

Parameter Value
Model ID Qwen/Qwen2.5-VL-7B-Instruct
Revision cc59489
Temperature 0.1
top_p 0.001
Repetition penalty 1.05
Max model length 32768
GPU memory utilization 0.85
MM processor cache 4 GB (unless disabled)
Max batched tokens 32768
Quantization Optional FP8 (default: enabled)

Two-Stage Captioning

When generate_stage2_caption=True and a stage2_prompt_text is provided:

  1. The model first generates a caption from the video input (stage 1)
  2. The stage 1 caption is appended to the stage2_prompt_text
  3. The combined text replaces the user_prompt placeholder in the original prompt using a regex pattern (.*)(user_prompt)(.*)
  4. The model generates again with the enriched prompt (stage 2)
  5. Only the stage 2 output is returned

This approach enables the model to first understand the video content, then produce a more detailed and structured caption in the second pass.

Usage Examples

Basic Usage

from nemo_curator.models.qwen_vl import QwenVL

# Download weights first
QwenVL.download_weights_on_node("/path/to/models")

# Initialize and setup
model = QwenVL(
    model_dir="/path/to/models",
    model_variant="qwen",
    caption_batch_size=8,
    fp8=True,
    max_output_tokens=512,
)
model.setup()

# Generate captions for video inputs
video_inputs = [
    {"prompt": formatted_prompt, "multi_modal_data": {"video": video_tensor}},
]
captions = model.generate(video_inputs)
print(captions[0])

Two-Stage Caption Generation

from nemo_curator.models.qwen_vl import QwenVL

model = QwenVL(
    model_dir="/path/to/models",
    model_variant="qwen",
    caption_batch_size=8,
    stage2_prompt_text="Based on this initial description, provide a detailed caption: ",
)
model.setup()

captions = model.generate(
    video_inputs,
    generate_stage2_caption=True,
    batch_size=4,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment