Implementation:NVIDIA NeMo Curator QwenVL
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Computer Vision, Video Captioning |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Wraps the Qwen2.5-VL-7B-Instruct vision-language model for video captioning via vLLM, with support for optional two-stage caption generation.
Description
The QwenVL class implements ModelInterface and provides a multimodal inference wrapper around the Qwen/Qwen2.5-VL-7B-Instruct model (revision cc59489) using the vLLM inference engine.
On setup(), it initializes a vLLM LLM instance configured for multimodal input (1 video per prompt), with optional FP8 quantization, configurable preprocessing delegation (via model_does_preprocess), optional multimedia processor cache (mm_processor_cache_gb), and a maximum model length of 32768 tokens with 85% GPU memory utilization. Sampling parameters are set conservatively: temperature=0.1, top_p=0.001, repetition_penalty=1.05.
The generate method processes batches of video inputs using grouping.split_by_chunk_size for memory-efficient batching. It supports an optional two-stage captioning workflow:
- Stage 1: Generates an initial caption from the video input
- Stage 2: Substitutes the stage 1 caption into a stage2_prompt template (via regex pattern matching) and re-generates for an enhanced, more detailed caption
The model variant system maps string keys to HuggingFace model IDs (currently only "qwen" maps to Qwen/Qwen2.5-VL-7B-Instruct). If vLLM is not installed, dummy classes are provided for type compatibility, and setup() raises an ImportError at runtime.
Usage
Use QwenVL as the primary model for video captioning in the NeMo Curator data curation pipeline. It generates natural language descriptions of video content, with the two-stage approach enabling richer, more detailed captions suitable for training data annotation.
Code Reference
Source Location
- Repository: NeMo-Curator
- File: nemo_curator/models/qwen_vl.py
- Lines: 1-157
Signature
class QwenVL(ModelInterface):
def __init__(
self,
model_dir: str,
model_variant: str,
caption_batch_size: int,
fp8: bool = True,
max_output_tokens: int = 512,
model_does_preprocess: bool = False,
disable_mmcache: bool = False,
stage2_prompt_text: str | None = None,
verbose: bool = False,
): ...
@property
def model_id_names(self) -> list[str]: ...
def setup(self) -> None: ...
def generate(
self,
videos: list[dict[str, Any]],
generate_stage2_caption: bool = False,
batch_size: int = 16,
) -> list[str]: ...
@classmethod
def download_weights_on_node(cls, model_dir: str) -> None: ...
Import
from nemo_curator.models.qwen_vl import QwenVL
I/O Contract
Inputs (Constructor)
| Name | Type | Required | Description |
|---|---|---|---|
| model_dir | str | Yes | Path to the directory where model weights are stored or will be downloaded |
| model_variant | str | Yes | Model variant key (currently only "qwen" is supported) |
| caption_batch_size | int | Yes | Batch size for caption generation |
| fp8 | bool | No | Whether to use FP8 quantization (default: True) |
| max_output_tokens | int | No | Maximum number of tokens to generate per input (default: 512) |
| model_does_preprocess | bool | No | Whether the model handles its own preprocessing (default: False) |
| disable_mmcache | bool | No | Whether to disable the multimedia processor cache (default: False) |
| stage2_prompt_text | str or None | No | Template text for two-stage captioning; contains "user_prompt" placeholder (default: None) |
| verbose | bool | No | Enable verbose logging (default: False) |
Inputs (generate)
| Name | Type | Required | Description |
|---|---|---|---|
| videos | list[dict[str, Any]] | Yes | List of video input dictionaries formatted for vLLM (with "prompt" and "multi_modal_data" keys) |
| generate_stage2_caption | bool | No | Whether to perform two-stage caption generation (default: False) |
| batch_size | int | No | Number of videos to process per batch (default: 16) |
Outputs
| Name | Type | Description |
|---|---|---|
| captions | list[str] | List of generated caption strings, one per input video |
Model Configuration
| Parameter | Value |
|---|---|
| Model ID | Qwen/Qwen2.5-VL-7B-Instruct |
| Revision | cc59489 |
| Temperature | 0.1 |
| top_p | 0.001 |
| Repetition penalty | 1.05 |
| Max model length | 32768 |
| GPU memory utilization | 0.85 |
| MM processor cache | 4 GB (unless disabled) |
| Max batched tokens | 32768 |
| Quantization | Optional FP8 (default: enabled) |
Two-Stage Captioning
When generate_stage2_caption=True and a stage2_prompt_text is provided:
- The model first generates a caption from the video input (stage 1)
- The stage 1 caption is appended to the stage2_prompt_text
- The combined text replaces the user_prompt placeholder in the original prompt using a regex pattern (.*)(user_prompt)(.*)
- The model generates again with the enriched prompt (stage 2)
- Only the stage 2 output is returned
This approach enables the model to first understand the video content, then produce a more detailed and structured caption in the second pass.
Usage Examples
Basic Usage
from nemo_curator.models.qwen_vl import QwenVL
# Download weights first
QwenVL.download_weights_on_node("/path/to/models")
# Initialize and setup
model = QwenVL(
model_dir="/path/to/models",
model_variant="qwen",
caption_batch_size=8,
fp8=True,
max_output_tokens=512,
)
model.setup()
# Generate captions for video inputs
video_inputs = [
{"prompt": formatted_prompt, "multi_modal_data": {"video": video_tensor}},
]
captions = model.generate(video_inputs)
print(captions[0])
Two-Stage Caption Generation
from nemo_curator.models.qwen_vl import QwenVL
model = QwenVL(
model_dir="/path/to/models",
model_variant="qwen",
caption_batch_size=8,
stage2_prompt_text="Based on this initial description, provide a detailed caption: ",
)
model.setup()
captions = model.generate(
video_inputs,
generate_stage2_caption=True,
batch_size=4,
)
Related Pages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- NVIDIA_NeMo_Curator_ModelInterface -- Base class that QwenVL implements
- NVIDIA_NeMo_Curator_QwenLM -- Text-only LM that pairs with QwenVL for caption enhancement
- NVIDIA_NeMo_Curator_PromptFormatter -- Formats prompts for QwenVL input