Implementation:Datajuicer Data juicer VideoExtractFramesMapper

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Processing, Mapping
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for extracting frames from video files provided by Data-Juicer.

Description

VideoExtractFramesMapper is a mapper operator that extracts frames from video files using configurable sampling methods and outputs them as either file paths or in-memory byte arrays. It supports two frame sampling methods: "all_keyframes" for extracting keyframes and "uniform" for evenly-spaced extraction, with optional duration-based video segmentation. Outputs frames in "path" format (saved to a configurable directory) or "bytes" format (loaded into memory), and stores frame information in the sample's metadata.

Usage

Use when you need to extract frames from videos as a prerequisite step for downstream video processing operators such as captioning, tagging, pose estimation, or any frame-level analysis.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/mapper/video_extract_frames_mapper.py

Signature

@OPERATORS.register_module("video_extract_frames_mapper")
class VideoExtractFramesMapper(Mapper):
    def __init__(self, frame_sampling_method: str = "all_keyframes", output_format: str = "path", frame_num: PositiveInt = 3, duration: float = 0, frame_dir: str = None, frame_key: str = None, frame_field: str = MetaKeys.video_frames, legacy_split_by_text_token: bool = True, video_backend: str = "av", *args, **kwargs):

Import

from data_juicer.ops.mapper.video_extract_frames_mapper import VideoExtractFramesMapper

I/O Contract

Inputs

Name	Type	Required	Description
frame_sampling_method	str	No	Sampling method: "all_keyframes" or "uniform" (default: "all_keyframes")
output_format	str	No	Output format: "path" or "bytes" (default: "path")
frame_num	PositiveInt	No	Number of frames for uniform sampling (default: 3)
duration	float	No	Duration of each segment in seconds; 0 for entire video (default: 0)
frame_dir	str	No	Output directory for extracted frames (required when output_format is "path")
frame_key	str	No	Deprecated field name for frame info; use frame_field instead
frame_field	str	No	Field name for generated frames info (default: "video_frames")
legacy_split_by_text_token	bool	No	Whether to split by special tokens in text field (default: True)
video_backend	str	No	Video backend: "ffmpeg" or "av" (default: "av")

Outputs

Name	Type	Description
samples	Dict	Transformed samples with extracted frame paths or bytes in metadata

Usage Examples

process:
  - video_extract_frames_mapper:
      frame_sampling_method: "uniform"
      frame_num: 8
      frame_dir: "/tmp/frames"
      output_format: "path"

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment