Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Datajuicer Data juicer VideoDeduplicator

From Leeroopedia
Knowledge Sources
Domains Data_Quality, Deduplication
Last Updated 2026-02-14 16:00 GMT

Overview

Concrete tool for document-level deduplication based on exact video content matching provided by Data-Juicer.

Description

VideoDeduplicator extends Deduplicator and computes hashes by loading video files, demuxing their streams via PyAV, and feeding all video packets into an MD5 hash. The resulting hex digest is stored as 'videohash'. When consider_text is set to True, it combines video hashes with text hashes (via DocumentDeduplicator) as a tuple key for deduplication. The process method uses a hash-set approach to filter duplicates, keeping only first occurrences. The operator is registered with LOADED_VIDEOS for operator fusion support. Samples without video content are kept unconditionally.

Usage

Use when you need to remove exact duplicate video content from video datasets before training, avoiding data leakage and reducing dataset size.

Code Reference

Source Location

Signature

@OPERATORS.register_module("video_deduplicator")
class VideoDeduplicator(Deduplicator):
    def __init__(self, consider_text: bool = False,
                 *args, **kwargs):

Import

from data_juicer.ops.deduplicator.video_deduplicator import VideoDeduplicator

I/O Contract

Inputs

Name Type Required Description
consider_text bool No Whether to consider text hash together with video hash for deduplication. Default: False

Outputs

Name Type Description
dataset Dataset Deduplicated dataset with duplicate video samples removed
dup_pairs dict Dictionary of sampled duplicate pairs (when show_num > 0)

Usage Examples

process:
  - video_deduplicator:
      consider_text: true

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment