Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Curator CaptionGenerationStage

From Leeroopedia
Knowledge Sources
Domains Data_Curation, Video_Processing, Multimodal_AI
Last Updated 2026-02-14 17:00 GMT

Overview

Concrete tool for generating video captions using vision-language models provided by NeMo Curator.

Description

The CaptionGenerationStage processes prepared video windows through a vision-language model (Qwen-VL) to generate descriptive captions. It supports FP8 quantization for reduced memory, configurable batch sizes, and max output token limits. Works in conjunction with CaptionPreparationStage (frame windowing) and CaptionEnhancementStage (LLM refinement).

Usage

Import this stage after CaptionPreparationStage and before CaptionEnhancementStage in a video captioning pipeline.

Code Reference

Source Location

  • Repository: NeMo Curator
  • File: nemo_curator/stages/video/caption/caption_generation.py
  • Lines: L28-128

Signature

@dataclass
class CaptionGenerationStage(ProcessingStage[VideoTask, VideoTask]):
    model_dir: str = "models/qwen"
    model_variant: str = "qwen"
    caption_batch_size: int = 16
    fp8: bool = False
    max_output_tokens: int = 512
    model_does_preprocess: bool = False
    disable_mmcache: bool = False
    verbose: bool = False
    generate_stage2_caption: bool = False
    stage2_prompt_text: str | None = None
    name: str = "caption_generation"

Import

from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage

I/O Contract

Inputs

Name Type Required Description
task VideoTask Yes Video with clips having prepared windows from CaptionPreparationStage

Outputs

Name Type Description
task VideoTask Video with clip.windows[].caption populated

Usage Examples

from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage

# 1. Prepare frames for captioning
prep = CaptionPreparationStage(model_variant="qwen", sampling_fps=2.0, window_size=256)

# 2. Generate captions with VL model
gen = CaptionGenerationStage(model_dir="models/qwen", caption_batch_size=16, fp8=False)

# 3. Enhance captions with LLM
enhance = CaptionEnhancementStage(model_dir="models/qwen", model_batch_size=128)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment