Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove SamplerFilter

From Leeroopedia
Knowledge Sources
Domains Data_Sampling, Statistics, NLP
Type Filter Module
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete filter class that randomly retains a configurable fraction of documents in a datatrove pipeline using Bernoulli sampling with numpy's random number generator.

Description

The SamplerFilter class extends BaseFilter and implements probabilistic document retention. On initialization, it creates a uniform random number generator from numpy.random.default_rng(seed). For each document, the filter() method draws a uniform random value in [0, 1) and returns True (keep) if the value is less than the configured rate.

Key characteristics:

  • Minimal implementation -- The entire filter logic is a single comparison: self.uniform() < self.rate.
  • Stateful RNG -- The random number generator maintains internal state across calls, so each document gets a different random draw. The state advances deterministically from the seed.
  • No content inspection -- The filter does not examine document text or metadata; the decision is purely random.

Usage

Use SamplerFilter to randomly subsample documents in a datatrove pipeline. Place it at any point in the pipeline where data volume reduction is desired. It is most commonly used before expensive computation steps or for creating evaluation subsets.

Code Reference

Source Location

Signature

class SamplerFilter(BaseFilter):
    name = "Sampler"

    def __init__(
        self,
        rate: float | None = 0.5,
        seed: int = None,
        exclusion_writer: DiskWriter = None,
    ):
        ...

    def filter(self, doc: Document) -> bool | tuple[bool, str]:
        ...

Import

from datatrove.pipeline.filters import SamplerFilter

I/O Contract

Inputs

Name Type Required Description
rate None No (default: 0.5) Probability of keeping each document. A rate of 0.5 keeps approximately 50% of documents.
seed int No (default: None) Random seed for reproducibility. If None, the RNG is seeded from system entropy.
exclusion_writer DiskWriter No (default: None) Optional writer to save rejected (sampled-out) documents

Pipeline Input: A DocumentsPipeline -- any stream of Document objects. The filter does not inspect document content or metadata.

Outputs

Name Type Description
bool bool True if the document is randomly selected to be kept (uniform draw < rate)

Pipeline Output: A DocumentsPipeline containing approximately rate * N documents from the original N-document input stream. The exact count varies due to random sampling.

Usage Examples

Keep 10% of Documents

from datatrove.pipeline.filters import SamplerFilter

# Randomly keep ~10% of documents
sampler = SamplerFilter(rate=0.1)

Reproducible Sampling

from datatrove.pipeline.filters import SamplerFilter

# Same seed produces the same sample every time
sampler = SamplerFilter(rate=0.01, seed=42)

Pipeline with Sampling for Development

from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import LanguageFilter, SamplerFilter, URLFilter
from datatrove.pipeline.readers import WarcReader

# Sample 1% of data for fast iteration during development
pipeline = LocalPipelineExecutor(
    pipeline=[
        WarcReader("s3://commoncrawl/crawl-data/CC-MAIN-2024-10/"),
        SamplerFilter(rate=0.01, seed=42),
        Trafilatura(favour_precision=True, timeout=1),
        URLFilter(),
        LanguageFilter(languages=["en"], language_threshold=0.65),
    ],
    tasks=10,
)
pipeline.run()

Saving Excluded Documents

from datatrove.pipeline.filters import SamplerFilter
from datatrove.pipeline.writers import JsonlWriter

# Keep 50%, write the other 50% to disk for analysis
sampler = SamplerFilter(
    rate=0.5,
    seed=123,
    exclusion_writer=JsonlWriter("s3://my-bucket/sampled-out/"),
)

Related Pages

Principle:Huggingface_Datatrove_Random_Sampling

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment