Implementation:Huggingface Datatrove SamplerFilter

Knowledge Sources	Huggingface_Datatrove
Domains	Data_Sampling, Statistics, NLP
Type	Filter Module
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete filter class that randomly retains a configurable fraction of documents in a datatrove pipeline using Bernoulli sampling with numpy's random number generator.

Description

The SamplerFilter class extends BaseFilter and implements probabilistic document retention. On initialization, it creates a uniform random number generator from numpy.random.default_rng(seed). For each document, the filter() method draws a uniform random value in [0, 1) and returns True (keep) if the value is less than the configured rate.

Key characteristics:

Minimal implementation -- The entire filter logic is a single comparison: self.uniform() < self.rate.
Stateful RNG -- The random number generator maintains internal state across calls, so each document gets a different random draw. The state advances deterministically from the seed.
No content inspection -- The filter does not examine document text or metadata; the decision is purely random.

Usage

Use SamplerFilter to randomly subsample documents in a datatrove pipeline. Place it at any point in the pipeline where data volume reduction is desired. It is most commonly used before expensive computation steps or for creating evaluation subsets.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/filters/sampler_filter.py
Lines: 8-28

Signature

class SamplerFilter(BaseFilter):
    name = "Sampler"

    def __init__(
        self,
        rate: float | None = 0.5,
        seed: int = None,
        exclusion_writer: DiskWriter = None,
    ):
        ...

    def filter(self, doc: Document) -> bool | tuple[bool, str]:
        ...

Import

from datatrove.pipeline.filters import SamplerFilter

I/O Contract

Inputs

Name	Type	Required	Description
rate	None	No (default: 0.5)	Probability of keeping each document. A rate of 0.5 keeps approximately 50% of documents.
seed	int	No (default: None)	Random seed for reproducibility. If None, the RNG is seeded from system entropy.
exclusion_writer	DiskWriter	No (default: None)	Optional writer to save rejected (sampled-out) documents

Pipeline Input: A DocumentsPipeline -- any stream of Document objects. The filter does not inspect document content or metadata.

Outputs

Name	Type	Description
bool	bool	`True` if the document is randomly selected to be kept (uniform draw < rate)

Pipeline Output: A DocumentsPipeline containing approximately rate * N documents from the original N-document input stream. The exact count varies due to random sampling.

Usage Examples

Keep 10% of Documents

from datatrove.pipeline.filters import SamplerFilter

# Randomly keep ~10% of documents
sampler = SamplerFilter(rate=0.1)

Reproducible Sampling

from datatrove.pipeline.filters import SamplerFilter

# Same seed produces the same sample every time
sampler = SamplerFilter(rate=0.01, seed=42)

Pipeline with Sampling for Development

from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import LanguageFilter, SamplerFilter, URLFilter
from datatrove.pipeline.readers import WarcReader

# Sample 1% of data for fast iteration during development
pipeline = LocalPipelineExecutor(
    pipeline=[
        WarcReader("s3://commoncrawl/crawl-data/CC-MAIN-2024-10/"),
        SamplerFilter(rate=0.01, seed=42),
        Trafilatura(favour_precision=True, timeout=1),
        URLFilter(),
        LanguageFilter(languages=["en"], language_threshold=0.65),
    ],
    tasks=10,
)
pipeline.run()

Saving Excluded Documents

from datatrove.pipeline.filters import SamplerFilter
from datatrove.pipeline.writers import JsonlWriter

# Keep 50%, write the other 50% to disk for analysis
sampler = SamplerFilter(
    rate=0.5,
    seed=123,
    exclusion_writer=JsonlWriter("s3://my-bucket/sampled-out/"),
)

Related Pages

Huggingface_Datatrove_Random_Sampling (principle) -- The principle this implementation realizes
Huggingface_Datatrove_LanguageFilter (related filter) -- Language filtering that may precede or follow sampling
Huggingface_Datatrove_URLFilter (related filter) -- URL filtering typically applied before sampling

Principle:Huggingface_Datatrove_Random_Sampling

Environment:Huggingface_Datatrove_Python_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment