Implementation:Datajuicer Data juicer VideoCaptioningFromVLMMapper

Knowledge Sources	Datajuicer_Data_juicer
Domains	Video Processing, Caption Generation, Vision-Language Models
Last Updated	2026-02-14 16:00 GMT

Overview

Generates video captions using a Vision-Language Model (VLM) that natively accepts video inputs, such as Qwen3-VL, enabling end-to-end video understanding without manual frame extraction.

Description

VideoCaptioningFromVLMMapper is a batched operator that processes video samples to generate captions using modern VLMs that understand video natively rather than through frame-level proxies. Key features include:

Native Video Input -- Sends video files directly to the VLM model (default: Qwen/Qwen3-VL-8B-Instruct), leveraging the model's built-in video understanding capabilities
VLLM Acceleration -- Optional VLLM backend support for faster inference with automatic tensor parallelism across available GPUs
Multiple Caption Candidates -- Generates configurable number of caption candidates per video (caption_num parameter)
Flexible Retention Strategies:
- random_any -- Retains one randomly chosen caption
- similar_one_simhash -- Retains the caption most similar to the original text using SimHash distance
- all -- Retains all generated captions
Configurable Prompts -- Supports per-sample prompts via prompt_key, global prompts via prompt, or the default prompt ("Describe the input video in 1-2 sentences.")
Original Sample Preservation -- Optionally keeps the original sample alongside generated captions

The operator processes text chunks with special tokens (SpecialTokens.video, SpecialTokens.eoc), generating captions for each video reference and inserting them into the output text. Uses both HuggingFace Transformers and VLLM backends depending on configuration.

Requires CUDA acceleration and allocates 70GB memory by default.

Usage

Use this operator to generate video captions for datasets that need temporal understanding. It is the most advanced video captioning approach in the framework, producing higher-quality temporal descriptions than frame-level approaches.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/mapper/video_captioning_from_vlm_mapper.py
Lines: 1-409

Signature

class VideoCaptioningFromVLMMapper(Mapper):
    _accelerator = 'cuda'
    _batched_op = True

    def __init__(
        self,
        hf_model: str = 'Qwen/Qwen3-VL-8B-Instruct',
        enable_vllm: bool = False,
        caption_num: PositiveInt = 1,
        keep_candidate_mode: str = 'random_any',
        keep_original_sample: bool = True,
        prompt: Optional[str] = None,
        prompt_key: Optional[str] = None,
        model_params: Dict = None,
        sampling_params: Dict = None,
        *args, **kwargs,
    ):

Import

from data_juicer.ops.mapper.video_captioning_from_vlm_mapper import VideoCaptioningFromVLMMapper

I/O Contract

Inputs

Name	Type	Required	Description
hf_model	str	No	VLM model name on HuggingFace. Default: "Qwen/Qwen3-VL-8B-Instruct"
enable_vllm	bool	No	Use VLLM for model loading/inference. Default: False
caption_num	PositiveInt	No	Number of candidate captions to generate per video. Default: 1
keep_candidate_mode	str	No	Retention strategy: "random_any", "similar_one_simhash", or "all". Default: "random_any"
keep_original_sample	bool	No	Whether to keep original sample alongside generated captions. Default: True
prompt	str	No	Global prompt to guide generation. Default: None (uses DEFAULT_PROMPT)
prompt_key	str	No	Field name for per-sample prompts. Default: None
model_params	Dict	No	Parameters for model initialization
sampling_params	Dict	No	Extra parameters for model inference (temperature, top_p, etc.)

Outputs

Name	Type	Description
samples	Dict	Column-oriented dictionary containing original samples (if kept) and generated caption samples with video special tokens

Usage Examples

# Basic usage with VLLM acceleration
mapper = VideoCaptioningFromVLMMapper(
    hf_model='Qwen/Qwen3-VL-8B-Instruct',
    enable_vllm=True,
    caption_num=3,
    keep_candidate_mode='random_any',
    keep_original_sample=True,
)

# With custom prompt and HuggingFace backend
mapper = VideoCaptioningFromVLMMapper(
    hf_model='Qwen/Qwen3-VL-8B-Instruct',
    enable_vllm=False,
    prompt="Describe the actions and events in this video in detail.",
    caption_num=1,
    sampling_params={"temperature": 0.7, "top_p": 0.95},
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment