Implementation:NVIDIA NeMo Curator CaptionGenerationStage
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, Video_Processing, Multimodal_AI |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Concrete tool for generating video captions using vision-language models provided by NeMo Curator.
Description
The CaptionGenerationStage processes prepared video windows through a vision-language model (Qwen-VL) to generate descriptive captions. It supports FP8 quantization for reduced memory, configurable batch sizes, and max output token limits. Works in conjunction with CaptionPreparationStage (frame windowing) and CaptionEnhancementStage (LLM refinement).
Usage
Import this stage after CaptionPreparationStage and before CaptionEnhancementStage in a video captioning pipeline.
Code Reference
Source Location
- Repository: NeMo Curator
- File: nemo_curator/stages/video/caption/caption_generation.py
- Lines: L28-128
Signature
@dataclass
class CaptionGenerationStage(ProcessingStage[VideoTask, VideoTask]):
model_dir: str = "models/qwen"
model_variant: str = "qwen"
caption_batch_size: int = 16
fp8: bool = False
max_output_tokens: int = 512
model_does_preprocess: bool = False
disable_mmcache: bool = False
verbose: bool = False
generate_stage2_caption: bool = False
stage2_prompt_text: str | None = None
name: str = "caption_generation"
Import
from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| task | VideoTask | Yes | Video with clips having prepared windows from CaptionPreparationStage |
Outputs
| Name | Type | Description |
|---|---|---|
| task | VideoTask | Video with clip.windows[].caption populated |
Usage Examples
from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage
# 1. Prepare frames for captioning
prep = CaptionPreparationStage(model_variant="qwen", sampling_fps=2.0, window_size=256)
# 2. Generate captions with VL model
gen = CaptionGenerationStage(model_dir="models/qwen", caption_batch_size=16, fp8=False)
# 3. Enhance captions with LLM
enhance = CaptionEnhancementStage(model_dir="models/qwen", model_batch_size=128)
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment