Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer Multimodal Utils

From Leeroopedia
Knowledge Sources
Domains Multimodal Processing, Image Processing, Video Processing, Audio Processing
Last Updated 2026-02-14 16:00 GMT

Overview

Comprehensive multimodal utility module providing functions for loading, processing, and manipulating images, audio, and video data within dataset samples, along with special token management for multimodal content.

Description

The mm_utils module is the largest and most heavily-used utility module in the framework, serving as the multimodal data handling backbone. Key sections include:

Special Tokens:

  • SpecialTokens -- Class (via _MetaSpecialTokens metaclass) that manages configurable placeholder tokens for image, audio, video, and end-of-content markers. Tokens are environment-variable-configurable via DJ_SPECIAL_TOKEN_* prefixes.
  • get_special_tokens, remove_special_tokens, remove_non_special_tokens -- Token query and text manipulation functions.

Context-Aware Data Loading:

  • load_data_with_context -- Unified loading function that supports caching loaded data in sample context fields, avoiding redundant file reads across multiple operators.
  • load_mm_bytes_from_sample -- Loads multimodal data from in-memory bytes stored in dataset samples.

Image Functions:

  • load_image / load_images -- Load PIL images from paths or bytes, converting to RGB.
  • load_image_byte / load_file_byte -- Load raw bytes from files.
  • image_path_to_base64 / image_byte_to_base64 -- Base64 encoding utilities.
  • pil_to_opencv -- PIL to OpenCV conversion (RGB to BGR).
  • detect_faces -- Face detection using OpenCV CascadeClassifier.
  • calculate_resized_dimensions -- Smart dimension calculation with target size, max length, and divisibility constraints.
  • iou -- Intersection-over-Union calculation for bounding boxes.

Audio Functions:

  • load_audio / load_audios -- Load audio using HuggingFace Audio feature decoder with optional sampling rate.

Video Functions:

  • load_video / load_videos -- Load video containers using PyAV.
  • get_video_duration -- Get video duration in seconds.
  • get_decoded_frames_from_video -- Decode all frames from a video stream.
  • cut_video_by_seconds -- Cut video segments by start/end timestamps with audio support.
  • process_each_frame -- Apply a transformation function to each video frame.
  • extract_key_frames / extract_key_frames_by_seconds -- Extract keyframes from video.
  • extract_video_frames_uniformly / extract_video_frames_uniformly_by_seconds -- Uniform frame sampling.
  • extract_audio_from_video -- Extract audio streams with optional time slicing.
  • close_video -- Safe video container cleanup to avoid memory leaks.

Utility Functions:

  • size_to_bytes -- Parse human-readable sizes (e.g., "10GB") to bytes.
  • timecode_string_to_seconds -- Parse "HH:MM:SS.fff" timecodes to float seconds.
  • parse_string_to_roi -- Parse region-of-interest strings to coordinate tuples.
  • insert_texts_after_placeholders -- Insert text after placeholder tokens in strings.

Usage

Use this module whenever working with multimodal data in operators. It provides the standard loading, transformation, and manipulation functions for all image, audio, and video operations throughout the framework.

Code Reference

Source Location

Signature

class SpecialTokens(metaclass=_MetaSpecialTokens):
    image: ClassVar[str]
    audio: ClassVar[str]
    video: ClassVar[str]
    eoc: ClassVar[str]

def load_data_with_context(sample, context, loaded_data_keys,
                           load_func, mm_bytes_key=None,
                           sample_idx=None) -> tuple: ...
def load_image(path_or_bytes) -> PIL.Image: ...
def load_audio(path, sampling_rate=None) -> tuple: ...
def load_video(path, mode="r") -> av.container: ...
def cut_video_by_seconds(input_video, output_video, start_seconds,
                         end_seconds=None, video_stream_index=0): ...
def extract_key_frames(input_video, video_stream_index=0) -> list: ...
def extract_video_frames_uniformly(input_video, frame_num) -> list: ...
def extract_audio_from_video(input_video, output_audio=None,
                             start_seconds=0, end_seconds=None,
                             stream_indexes=None) -> tuple: ...
def size_to_bytes(size: str) -> int: ...
def close_video(container): ...

Import

from data_juicer.utils.mm_utils import (
    SpecialTokens, load_image, load_video, load_audio,
    cut_video_by_seconds, extract_key_frames, size_to_bytes,
    load_data_with_context, close_video
)

I/O Contract

Inputs

Name Type Required Description
path_or_bytes Union[str, bytes] Yes File path or raw bytes to load image/audio/video from.
sample dict Yes Dataset sample containing multimodal data keys and optional context.
context bool Yes Whether context-based caching is enabled.
loaded_data_keys list Yes List of data paths/keys to load.
load_func callable Yes Function to use for loading data items.
input_video Union[str, av.container] Yes Video path or container for video operations.
frame_num PositiveInt Yes Number of frames to extract uniformly.

Outputs

Name Type Description
image PIL.Image Loaded PIL image in RGB mode.
audio_data tuple Tuple of (audio_array, sampling_rate).
container av.container PyAV video container.
frames list List of extracted video frames.
audio_data_list tuple Tuple of (audio_arrays, sampling_rates, stream_indexes).

Usage Examples

from data_juicer.utils.mm_utils import (
    SpecialTokens, load_image, load_video, extract_key_frames,
    cut_video_by_seconds, size_to_bytes, close_video
)

# Special tokens
print(SpecialTokens.image)  # "<__dj__image>"

# Load and process an image
img = load_image("/data/photos/example.jpg")
print(img.size)  # (width, height)

# Load video and extract key frames
container = load_video("/data/videos/sample.mp4")
key_frames = extract_key_frames(container)
print(f"Found {len(key_frames)} key frames")
close_video(container)

# Cut a video segment
cut_video_by_seconds("/data/input.mp4", "/data/output.mp4",
                     start_seconds=5.0, end_seconds=15.0)

# Parse a file size string
bytes_val = size_to_bytes("10GB")  # 10737418240

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment