Implementation:Datajuicer Data juicer VideoDeduplicator

Knowledge Sources	Datajuicer_Data_juicer
Domains	Data_Quality, Deduplication
Last Updated	2026-02-14 16:00 GMT

Overview

Concrete tool for document-level deduplication based on exact video content matching provided by Data-Juicer.

Description

VideoDeduplicator extends Deduplicator and computes hashes by loading video files, demuxing their streams via PyAV, and feeding all video packets into an MD5 hash. The resulting hex digest is stored as 'videohash'. When consider_text is set to True, it combines video hashes with text hashes (via DocumentDeduplicator) as a tuple key for deduplication. The process method uses a hash-set approach to filter duplicates, keeping only first occurrences. The operator is registered with LOADED_VIDEOS for operator fusion support. Samples without video content are kept unconditionally.

Usage

Use when you need to remove exact duplicate video content from video datasets before training, avoiding data leakage and reducing dataset size.

Code Reference

Source Location

Repository: Datajuicer_Data_juicer
File: data_juicer/ops/deduplicator/video_deduplicator.py

Signature

@OPERATORS.register_module("video_deduplicator")
class VideoDeduplicator(Deduplicator):
    def __init__(self, consider_text: bool = False,
                 *args, **kwargs):

Import

from data_juicer.ops.deduplicator.video_deduplicator import VideoDeduplicator

I/O Contract

Inputs

Name	Type	Required	Description
consider_text	bool	No	Whether to consider text hash together with video hash for deduplication. Default: False

Outputs

Name	Type	Description
dataset	Dataset	Deduplicated dataset with duplicate video samples removed
dup_pairs	dict	Dictionary of sampled duplicate pairs (when show_num > 0)

Usage Examples

process:
  - video_deduplicator:
      consider_text: true

Related Pages

Environment:Datajuicer_Data_juicer_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment