Implementation:Datajuicer Data juicer VideoCaptioningFromAudioMapper

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Mapping
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for generating video captions from audio streams provided by Data-Juicer.

Description

VideoCaptioningFromAudioMapper is a mapper operator that generates text captions for videos based on their audio streams using the Qwen-Audio model. It extracts audio streams from each video, processes them through the Qwen-Audio HuggingFace model with a transcription/captioning prompt, strips special tokens from the output using regex, and inserts the generated captions into the sample text, optionally keeping the original sample alongside the captioned version.

Usage

Use when you need multimodal video understanding by capturing information from the audio channel, particularly valuable for videos where visual content alone is insufficient such as narrated content or dialogue-heavy scenes.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/mapper/video_captioning_from_audio_mapper.py

Signature

@OPERATORS.register_module("video_captioning_from_audio_mapper")
class VideoCaptioningFromAudioMapper(Mapper):
    def __init__(self, keep_original_sample: bool = True, *args, **kwargs):

Import

from data_juicer.ops.mapper.video_captioning_from_audio_mapper import VideoCaptioningFromAudioMapper

I/O Contract

Inputs

Name	Type	Required	Description
keep_original_sample	bool	No	Whether to keep the original sample alongside the captioned version (default: True)

Outputs

Name	Type	Description
samples	Dict	Transformed samples with audio-derived captions inserted into text

Usage Examples

process:
  - video_captioning_from_audio_mapper:
      keep_original_sample: true

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment