Implementation:Datajuicer Data juicer VideoCaptioningFromAudioMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for generating video captions from audio streams provided by Data-Juicer.
Description
VideoCaptioningFromAudioMapper is a mapper operator that generates text captions for videos based on their audio streams using the Qwen-Audio model. It extracts audio streams from each video, processes them through the Qwen-Audio HuggingFace model with a transcription/captioning prompt, strips special tokens from the output using regex, and inserts the generated captions into the sample text, optionally keeping the original sample alongside the captioned version.
Usage
Use when you need multimodal video understanding by capturing information from the audio channel, particularly valuable for videos where visual content alone is insufficient such as narrated content or dialogue-heavy scenes.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/video_captioning_from_audio_mapper.py
Signature
@OPERATORS.register_module("video_captioning_from_audio_mapper")
class VideoCaptioningFromAudioMapper(Mapper):
def __init__(self, keep_original_sample: bool = True, *args, **kwargs):
Import
from data_juicer.ops.mapper.video_captioning_from_audio_mapper import VideoCaptioningFromAudioMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| keep_original_sample | bool | No | Whether to keep the original sample alongside the captioned version (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with audio-derived captions inserted into text |
Usage Examples
process:
- video_captioning_from_audio_mapper:
keep_original_sample: true