Implementation:Datajuicer Data juicer VideoSplitByKeyFrameMapper
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, Mapping |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for splitting videos at key frame boundaries provided by Data-Juicer.
Description
VideoSplitByKeyFrameMapper is a mapper operator that splits videos into segments at key frame boundaries, producing clips that align with natural visual transition points in the video. It detects key frames in each video using the video reader backend, splits the video at these boundaries using FFmpeg or PyAV, saves the resulting segments as separate files (or byte arrays), and updates the sample's video references and text placeholders, supporting both "path" and "bytes" output formats with optional original sample preservation.
Usage
Use when you need semantically meaningful video splitting compared to duration-based splitting, as key frames typically correspond to shot boundaries and scene transitions.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/mapper/video_split_by_key_frame_mapper.py
Signature
@OPERATORS.register_module("video_split_by_key_frame_mapper")
class VideoSplitByKeyFrameMapper(Mapper):
def __init__(self, keep_original_sample: bool = True, save_dir: str = None, video_backend: str = "av", ffmpeg_extra_args: str = "", output_format: str = "path", save_field: str = None, legacy_split_by_text_token: bool = True, *args, **kwargs):
Import
from data_juicer.ops.mapper.video_split_by_key_frame_mapper import VideoSplitByKeyFrameMapper
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| keep_original_sample | bool | No | Whether to keep the original sample (default: True) |
| save_dir | str | Yes | Directory for generated split video files (must be specified) |
| video_backend | str | No | Video backend: "ffmpeg" or "av" (default: "av") |
| ffmpeg_extra_args | str | No | Extra FFmpeg args for splitting video (default: "") |
| output_format | str | No | Output format: "path" or "bytes" (default: "path") |
| save_field | str | No | New field name to save generated video paths; if not specified, overwrites original field |
| legacy_split_by_text_token | bool | No | Whether to split by special tokens in the text field (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Transformed samples with key-frame-split video segment file paths or bytes |
Usage Examples
process:
- video_split_by_key_frame_mapper:
save_dir: "/tmp/split_videos"
video_backend: "av"
keep_original_sample: false