Implementation:Datajuicer Data juicer Multimodal Utils

Knowledge Sources	Datajuicer_Data_juicer
Domains	Multimodal Processing, Image Processing, Video Processing, Audio Processing
Last Updated	2026-02-14 16:00 GMT

Overview

Comprehensive multimodal utility module providing functions for loading, processing, and manipulating images, audio, and video data within dataset samples, along with special token management for multimodal content.

Description

The mm_utils module is the largest and most heavily-used utility module in the framework, serving as the multimodal data handling backbone. Key sections include:

Special Tokens:

SpecialTokens -- Class (via _MetaSpecialTokens metaclass) that manages configurable placeholder tokens for image, audio, video, and end-of-content markers. Tokens are environment-variable-configurable via DJ_SPECIAL_TOKEN_* prefixes.
get_special_tokens, remove_special_tokens, remove_non_special_tokens -- Token query and text manipulation functions.

Context-Aware Data Loading:

load_data_with_context -- Unified loading function that supports caching loaded data in sample context fields, avoiding redundant file reads across multiple operators.
load_mm_bytes_from_sample -- Loads multimodal data from in-memory bytes stored in dataset samples.

Image Functions:

load_image / load_images -- Load PIL images from paths or bytes, converting to RGB.
load_image_byte / load_file_byte -- Load raw bytes from files.
image_path_to_base64 / image_byte_to_base64 -- Base64 encoding utilities.
pil_to_opencv -- PIL to OpenCV conversion (RGB to BGR).
detect_faces -- Face detection using OpenCV CascadeClassifier.
calculate_resized_dimensions -- Smart dimension calculation with target size, max length, and divisibility constraints.
iou -- Intersection-over-Union calculation for bounding boxes.

Audio Functions:

load_audio / load_audios -- Load audio using HuggingFace Audio feature decoder with optional sampling rate.

Video Functions:

load_video / load_videos -- Load video containers using PyAV.
get_video_duration -- Get video duration in seconds.
get_decoded_frames_from_video -- Decode all frames from a video stream.
cut_video_by_seconds -- Cut video segments by start/end timestamps with audio support.
process_each_frame -- Apply a transformation function to each video frame.
extract_key_frames / extract_key_frames_by_seconds -- Extract keyframes from video.
extract_video_frames_uniformly / extract_video_frames_uniformly_by_seconds -- Uniform frame sampling.
extract_audio_from_video -- Extract audio streams with optional time slicing.
close_video -- Safe video container cleanup to avoid memory leaks.

Utility Functions:

size_to_bytes -- Parse human-readable sizes (e.g., "10GB") to bytes.
timecode_string_to_seconds -- Parse "HH:MM:SS.fff" timecodes to float seconds.
parse_string_to_roi -- Parse region-of-interest strings to coordinate tuples.
insert_texts_after_placeholders -- Insert text after placeholder tokens in strings.

Usage

Use this module whenever working with multimodal data in operators. It provides the standard loading, transformation, and manipulation functions for all image, audio, and video operations throughout the framework.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/utils/mm_utils.py

Signature

class SpecialTokens(metaclass=_MetaSpecialTokens):
    image: ClassVar[str]
    audio: ClassVar[str]
    video: ClassVar[str]
    eoc: ClassVar[str]

def load_data_with_context(sample, context, loaded_data_keys,
                           load_func, mm_bytes_key=None,
                           sample_idx=None) -> tuple: ...
def load_image(path_or_bytes) -> PIL.Image: ...
def load_audio(path, sampling_rate=None) -> tuple: ...
def load_video(path, mode="r") -> av.container: ...
def cut_video_by_seconds(input_video, output_video, start_seconds,
                         end_seconds=None, video_stream_index=0): ...
def extract_key_frames(input_video, video_stream_index=0) -> list: ...
def extract_video_frames_uniformly(input_video, frame_num) -> list: ...
def extract_audio_from_video(input_video, output_audio=None,
                             start_seconds=0, end_seconds=None,
                             stream_indexes=None) -> tuple: ...
def size_to_bytes(size: str) -> int: ...
def close_video(container): ...

Import

from data_juicer.utils.mm_utils import (
    SpecialTokens, load_image, load_video, load_audio,
    cut_video_by_seconds, extract_key_frames, size_to_bytes,
    load_data_with_context, close_video
)

I/O Contract

Inputs

Name	Type	Required	Description
path_or_bytes	Union[str, bytes]	Yes	File path or raw bytes to load image/audio/video from.
sample	dict	Yes	Dataset sample containing multimodal data keys and optional context.
context	bool	Yes	Whether context-based caching is enabled.
loaded_data_keys	list	Yes	List of data paths/keys to load.
load_func	callable	Yes	Function to use for loading data items.
input_video	Union[str, av.container]	Yes	Video path or container for video operations.
frame_num	PositiveInt	Yes	Number of frames to extract uniformly.

Outputs

Name	Type	Description
image	PIL.Image	Loaded PIL image in RGB mode.
audio_data	tuple	Tuple of (audio_array, sampling_rate).
container	av.container	PyAV video container.
frames	list	List of extracted video frames.
audio_data_list	tuple	Tuple of (audio_arrays, sampling_rates, stream_indexes).

Usage Examples

from data_juicer.utils.mm_utils import (
    SpecialTokens, load_image, load_video, extract_key_frames,
    cut_video_by_seconds, size_to_bytes, close_video
)

# Special tokens
print(SpecialTokens.image)  # "<__dj__image>"

# Load and process an image
img = load_image("/data/photos/example.jpg")
print(img.size)  # (width, height)

# Load video and extract key frames
container = load_video("/data/videos/sample.mp4")
key_frames = extract_key_frames(container)
print(f"Found {len(key_frames)} key frames")
close_video(container)

# Cut a video segment
cut_video_by_seconds("/data/input.mp4", "/data/output.mp4",
                     start_seconds=5.0, end_seconds=15.0)

# Parse a file size string
bytes_val = size_to_bytes("10GB")  # 10737418240

Related Pages

Datajuicer_Data_juicer_Video_Utils -- Extended video reader implementations with multiple backends
Datajuicer_Data_juicer_File_Utils -- File utility functions used for path manipulation
Datajuicer_Data_juicer_Model_Utils -- Model loading for multimodal model-based operators

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment