Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer VideoCaptioningFromVLMMapper

From Leeroopedia
Revision as of 12:23, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Datajuicer_Data_juicer_VideoCaptioningFromVLMMapper.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Video Processing, Caption Generation, Vision-Language Models
Last Updated 2026-02-14 16:00 GMT

Overview

Generates video captions using a Vision-Language Model (VLM) that natively accepts video inputs, such as Qwen3-VL, enabling end-to-end video understanding without manual frame extraction.

Description

VideoCaptioningFromVLMMapper is a batched operator that processes video samples to generate captions using modern VLMs that understand video natively rather than through frame-level proxies. Key features include:

  • Native Video Input -- Sends video files directly to the VLM model (default: Qwen/Qwen3-VL-8B-Instruct), leveraging the model's built-in video understanding capabilities
  • VLLM Acceleration -- Optional VLLM backend support for faster inference with automatic tensor parallelism across available GPUs
  • Multiple Caption Candidates -- Generates configurable number of caption candidates per video (caption_num parameter)
  • Flexible Retention Strategies:
    • random_any -- Retains one randomly chosen caption
    • similar_one_simhash -- Retains the caption most similar to the original text using SimHash distance
    • all -- Retains all generated captions
  • Configurable Prompts -- Supports per-sample prompts via prompt_key, global prompts via prompt, or the default prompt ("Describe the input video in 1-2 sentences.")
  • Original Sample Preservation -- Optionally keeps the original sample alongside generated captions

The operator processes text chunks with special tokens (SpecialTokens.video, SpecialTokens.eoc), generating captions for each video reference and inserting them into the output text. Uses both HuggingFace Transformers and VLLM backends depending on configuration.

Requires CUDA acceleration and allocates 70GB memory by default.

Usage

Use this operator to generate video captions for datasets that need temporal understanding. It is the most advanced video captioning approach in the framework, producing higher-quality temporal descriptions than frame-level approaches.

Code Reference

Source Location

  • Repository: Datajuicer_Data_juicer
  • File: data_juicer/ops/mapper/video_captioning_from_vlm_mapper.py
  • Lines: 1-409

Signature

class VideoCaptioningFromVLMMapper(Mapper):
    _accelerator = 'cuda'
    _batched_op = True

    def __init__(
        self,
        hf_model: str = 'Qwen/Qwen3-VL-8B-Instruct',
        enable_vllm: bool = False,
        caption_num: PositiveInt = 1,
        keep_candidate_mode: str = 'random_any',
        keep_original_sample: bool = True,
        prompt: Optional[str] = None,
        prompt_key: Optional[str] = None,
        model_params: Dict = None,
        sampling_params: Dict = None,
        *args, **kwargs,
    ):

Import

from data_juicer.ops.mapper.video_captioning_from_vlm_mapper import VideoCaptioningFromVLMMapper

I/O Contract

Inputs

Name Type Required Description
hf_model str No VLM model name on HuggingFace. Default: "Qwen/Qwen3-VL-8B-Instruct"
enable_vllm bool No Use VLLM for model loading/inference. Default: False
caption_num PositiveInt No Number of candidate captions to generate per video. Default: 1
keep_candidate_mode str No Retention strategy: "random_any", "similar_one_simhash", or "all". Default: "random_any"
keep_original_sample bool No Whether to keep original sample alongside generated captions. Default: True
prompt str No Global prompt to guide generation. Default: None (uses DEFAULT_PROMPT)
prompt_key str No Field name for per-sample prompts. Default: None
model_params Dict No Parameters for model initialization
sampling_params Dict No Extra parameters for model inference (temperature, top_p, etc.)

Outputs

Name Type Description
samples Dict Column-oriented dictionary containing original samples (if kept) and generated caption samples with video special tokens

Usage Examples

# Basic usage with VLLM acceleration
mapper = VideoCaptioningFromVLMMapper(
    hf_model='Qwen/Qwen3-VL-8B-Instruct',
    enable_vllm=True,
    caption_num=3,
    keep_candidate_mode='random_any',
    keep_original_sample=True,
)

# With custom prompt and HuggingFace backend
mapper = VideoCaptioningFromVLMMapper(
    hf_model='Qwen/Qwen3-VL-8B-Instruct',
    enable_vllm=False,
    prompt="Describe the actions and events in this video in detail.",
    caption_num=1,
    sampling_params={"temperature": 0.7, "top_p": 0.95},
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment