Implementation:Datajuicer Data juicer Multimodal Utils
| Knowledge Sources | |
|---|---|
| Domains | Multimodal Processing, Image Processing, Video Processing, Audio Processing |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Comprehensive multimodal utility module providing functions for loading, processing, and manipulating images, audio, and video data within dataset samples, along with special token management for multimodal content.
Description
The mm_utils module is the largest and most heavily-used utility module in the framework, serving as the multimodal data handling backbone. Key sections include:
Special Tokens:
- SpecialTokens -- Class (via
_MetaSpecialTokensmetaclass) that manages configurable placeholder tokens for image, audio, video, and end-of-content markers. Tokens are environment-variable-configurable viaDJ_SPECIAL_TOKEN_*prefixes. get_special_tokens,remove_special_tokens,remove_non_special_tokens-- Token query and text manipulation functions.
Context-Aware Data Loading:
load_data_with_context-- Unified loading function that supports caching loaded data in sample context fields, avoiding redundant file reads across multiple operators.load_mm_bytes_from_sample-- Loads multimodal data from in-memory bytes stored in dataset samples.
Image Functions:
load_image/load_images-- Load PIL images from paths or bytes, converting to RGB.load_image_byte/load_file_byte-- Load raw bytes from files.image_path_to_base64/image_byte_to_base64-- Base64 encoding utilities.pil_to_opencv-- PIL to OpenCV conversion (RGB to BGR).detect_faces-- Face detection using OpenCV CascadeClassifier.calculate_resized_dimensions-- Smart dimension calculation with target size, max length, and divisibility constraints.iou-- Intersection-over-Union calculation for bounding boxes.
Audio Functions:
load_audio/load_audios-- Load audio using HuggingFace Audio feature decoder with optional sampling rate.
Video Functions:
load_video/load_videos-- Load video containers using PyAV.get_video_duration-- Get video duration in seconds.get_decoded_frames_from_video-- Decode all frames from a video stream.cut_video_by_seconds-- Cut video segments by start/end timestamps with audio support.process_each_frame-- Apply a transformation function to each video frame.extract_key_frames/extract_key_frames_by_seconds-- Extract keyframes from video.extract_video_frames_uniformly/extract_video_frames_uniformly_by_seconds-- Uniform frame sampling.extract_audio_from_video-- Extract audio streams with optional time slicing.close_video-- Safe video container cleanup to avoid memory leaks.
Utility Functions:
size_to_bytes-- Parse human-readable sizes (e.g., "10GB") to bytes.timecode_string_to_seconds-- Parse "HH:MM:SS.fff" timecodes to float seconds.parse_string_to_roi-- Parse region-of-interest strings to coordinate tuples.insert_texts_after_placeholders-- Insert text after placeholder tokens in strings.
Usage
Use this module whenever working with multimodal data in operators. It provides the standard loading, transformation, and manipulation functions for all image, audio, and video operations throughout the framework.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File:
data_juicer/utils/mm_utils.py
Signature
class SpecialTokens(metaclass=_MetaSpecialTokens):
image: ClassVar[str]
audio: ClassVar[str]
video: ClassVar[str]
eoc: ClassVar[str]
def load_data_with_context(sample, context, loaded_data_keys,
load_func, mm_bytes_key=None,
sample_idx=None) -> tuple: ...
def load_image(path_or_bytes) -> PIL.Image: ...
def load_audio(path, sampling_rate=None) -> tuple: ...
def load_video(path, mode="r") -> av.container: ...
def cut_video_by_seconds(input_video, output_video, start_seconds,
end_seconds=None, video_stream_index=0): ...
def extract_key_frames(input_video, video_stream_index=0) -> list: ...
def extract_video_frames_uniformly(input_video, frame_num) -> list: ...
def extract_audio_from_video(input_video, output_audio=None,
start_seconds=0, end_seconds=None,
stream_indexes=None) -> tuple: ...
def size_to_bytes(size: str) -> int: ...
def close_video(container): ...
Import
from data_juicer.utils.mm_utils import (
SpecialTokens, load_image, load_video, load_audio,
cut_video_by_seconds, extract_key_frames, size_to_bytes,
load_data_with_context, close_video
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path_or_bytes | Union[str, bytes] | Yes | File path or raw bytes to load image/audio/video from. |
| sample | dict | Yes | Dataset sample containing multimodal data keys and optional context. |
| context | bool | Yes | Whether context-based caching is enabled. |
| loaded_data_keys | list | Yes | List of data paths/keys to load. |
| load_func | callable | Yes | Function to use for loading data items. |
| input_video | Union[str, av.container] | Yes | Video path or container for video operations. |
| frame_num | PositiveInt | Yes | Number of frames to extract uniformly. |
Outputs
| Name | Type | Description |
|---|---|---|
| image | PIL.Image | Loaded PIL image in RGB mode. |
| audio_data | tuple | Tuple of (audio_array, sampling_rate). |
| container | av.container | PyAV video container. |
| frames | list | List of extracted video frames. |
| audio_data_list | tuple | Tuple of (audio_arrays, sampling_rates, stream_indexes). |
Usage Examples
from data_juicer.utils.mm_utils import (
SpecialTokens, load_image, load_video, extract_key_frames,
cut_video_by_seconds, size_to_bytes, close_video
)
# Special tokens
print(SpecialTokens.image) # "<__dj__image>"
# Load and process an image
img = load_image("/data/photos/example.jpg")
print(img.size) # (width, height)
# Load video and extract key frames
container = load_video("/data/videos/sample.mp4")
key_frames = extract_key_frames(container)
print(f"Found {len(key_frames)} key frames")
close_video(container)
# Cut a video segment
cut_video_by_seconds("/data/input.mp4", "/data/output.mp4",
start_seconds=5.0, end_seconds=15.0)
# Parse a file size string
bytes_val = size_to_bytes("10GB") # 10737418240
Related Pages
- Datajuicer_Data_juicer_Video_Utils -- Extended video reader implementations with multiple backends
- Datajuicer_Data_juicer_File_Utils -- File utility functions used for path manipulation
- Datajuicer_Data_juicer_Model_Utils -- Model loading for multimodal model-based operators