Implementation:Datajuicer Data juicer VideoDeduplicator
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Deduplication |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for document-level deduplication based on exact video content matching provided by Data-Juicer.
Description
VideoDeduplicator extends Deduplicator and computes hashes by loading video files, demuxing their streams via PyAV, and feeding all video packets into an MD5 hash. The resulting hex digest is stored as 'videohash'. When consider_text is set to True, it combines video hashes with text hashes (via DocumentDeduplicator) as a tuple key for deduplication. The process method uses a hash-set approach to filter duplicates, keeping only first occurrences. The operator is registered with LOADED_VIDEOS for operator fusion support. Samples without video content are kept unconditionally.
Usage
Use when you need to remove exact duplicate video content from video datasets before training, avoiding data leakage and reducing dataset size.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/deduplicator/video_deduplicator.py
Signature
@OPERATORS.register_module("video_deduplicator")
class VideoDeduplicator(Deduplicator):
def __init__(self, consider_text: bool = False,
*args, **kwargs):
Import
from data_juicer.ops.deduplicator.video_deduplicator import VideoDeduplicator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| consider_text | bool | No | Whether to consider text hash together with video hash for deduplication. Default: False |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | Deduplicated dataset with duplicate video samples removed |
| dup_pairs | dict | Dictionary of sampled duplicate pairs (when show_num > 0) |
Usage Examples
process:
- video_deduplicator:
consider_text: true