Implementation:Datajuicer Data juicer VideoCaptioningFromVLMMapper
| Knowledge Sources | |
|---|---|
| Domains | Video Processing, Caption Generation, Vision-Language Models |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Generates video captions using a Vision-Language Model (VLM) that natively accepts video inputs, such as Qwen3-VL, enabling end-to-end video understanding without manual frame extraction.
Description
VideoCaptioningFromVLMMapper is a batched operator that processes video samples to generate captions using modern VLMs that understand video natively rather than through frame-level proxies. Key features include:
- Native Video Input -- Sends video files directly to the VLM model (default: Qwen/Qwen3-VL-8B-Instruct), leveraging the model's built-in video understanding capabilities
- VLLM Acceleration -- Optional VLLM backend support for faster inference with automatic tensor parallelism across available GPUs
- Multiple Caption Candidates -- Generates configurable number of caption candidates per video (caption_num parameter)
- Flexible Retention Strategies:
- random_any -- Retains one randomly chosen caption
- similar_one_simhash -- Retains the caption most similar to the original text using SimHash distance
- all -- Retains all generated captions
- Configurable Prompts -- Supports per-sample prompts via prompt_key, global prompts via prompt, or the default prompt ("Describe the input video in 1-2 sentences.")
- Original Sample Preservation -- Optionally keeps the original sample alongside generated captions
The operator processes text chunks with special tokens (SpecialTokens.video, SpecialTokens.eoc), generating captions for each video reference and inserting them into the output text. Uses both HuggingFace Transformers and VLLM backends depending on configuration.
Requires CUDA acceleration and allocates 70GB memory by default.
Usage
Use this operator to generate video captions for datasets that need temporal understanding. It is the most advanced video captioning approach in the framework, producing higher-quality temporal descriptions than frame-level approaches.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/video_captioning_from_vlm_mapper.py
- Lines: 1-409
Signature
class VideoCaptioningFromVLMMapper(Mapper):
_accelerator = 'cuda'
_batched_op = True
def __init__(
self,
hf_model: str = 'Qwen/Qwen3-VL-8B-Instruct',
enable_vllm: bool = False,
caption_num: PositiveInt = 1,
keep_candidate_mode: str = 'random_any',
keep_original_sample: bool = True,
prompt: Optional[str] = None,
prompt_key: Optional[str] = None,
model_params: Dict = None,
sampling_params: Dict = None,
*args, **kwargs,
):
Import
from data_juicer.ops.mapper.video_captioning_from_vlm_mapper import VideoCaptioningFromVLMMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hf_model | str | No | VLM model name on HuggingFace. Default: "Qwen/Qwen3-VL-8B-Instruct" |
| enable_vllm | bool | No | Use VLLM for model loading/inference. Default: False |
| caption_num | PositiveInt | No | Number of candidate captions to generate per video. Default: 1 |
| keep_candidate_mode | str | No | Retention strategy: "random_any", "similar_one_simhash", or "all". Default: "random_any" |
| keep_original_sample | bool | No | Whether to keep original sample alongside generated captions. Default: True |
| prompt | str | No | Global prompt to guide generation. Default: None (uses DEFAULT_PROMPT) |
| prompt_key | str | No | Field name for per-sample prompts. Default: None |
| model_params | Dict | No | Parameters for model initialization |
| sampling_params | Dict | No | Extra parameters for model inference (temperature, top_p, etc.) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Column-oriented dictionary containing original samples (if kept) and generated caption samples with video special tokens |
Usage Examples
# Basic usage with VLLM acceleration
mapper = VideoCaptioningFromVLMMapper(
hf_model='Qwen/Qwen3-VL-8B-Instruct',
enable_vllm=True,
caption_num=3,
keep_candidate_mode='random_any',
keep_original_sample=True,
)
# With custom prompt and HuggingFace backend
mapper = VideoCaptioningFromVLMMapper(
hf_model='Qwen/Qwen3-VL-8B-Instruct',
enable_vllm=False,
prompt="Describe the actions and events in this video in detail.",
caption_num=1,
sampling_params={"temperature": 0.7, "top_p": 0.95},
)