Implementation:Huggingface Datatrove SamplerFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Sampling, Statistics, NLP |
| Type | Filter Module |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete filter class that randomly retains a configurable fraction of documents in a datatrove pipeline using Bernoulli sampling with numpy's random number generator.
Description
The SamplerFilter class extends BaseFilter and implements probabilistic document retention. On initialization, it creates a uniform random number generator from numpy.random.default_rng(seed). For each document, the filter() method draws a uniform random value in [0, 1) and returns True (keep) if the value is less than the configured rate.
Key characteristics:
- Minimal implementation -- The entire filter logic is a single comparison:
self.uniform() < self.rate. - Stateful RNG -- The random number generator maintains internal state across calls, so each document gets a different random draw. The state advances deterministically from the seed.
- No content inspection -- The filter does not examine document text or metadata; the decision is purely random.
Usage
Use SamplerFilter to randomly subsample documents in a datatrove pipeline. Place it at any point in the pipeline where data volume reduction is desired. It is most commonly used before expensive computation steps or for creating evaluation subsets.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/filters/sampler_filter.py
- Lines: 8-28
Signature
class SamplerFilter(BaseFilter):
name = "Sampler"
def __init__(
self,
rate: float | None = 0.5,
seed: int = None,
exclusion_writer: DiskWriter = None,
):
...
def filter(self, doc: Document) -> bool | tuple[bool, str]:
...
Import
from datatrove.pipeline.filters import SamplerFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| rate | None | No (default: 0.5) | Probability of keeping each document. A rate of 0.5 keeps approximately 50% of documents. |
| seed | int | No (default: None) | Random seed for reproducibility. If None, the RNG is seeded from system entropy. |
| exclusion_writer | DiskWriter | No (default: None) | Optional writer to save rejected (sampled-out) documents |
Pipeline Input: A DocumentsPipeline -- any stream of Document objects. The filter does not inspect document content or metadata.
Outputs
| Name | Type | Description |
|---|---|---|
| bool | bool | True if the document is randomly selected to be kept (uniform draw < rate)
|
Pipeline Output: A DocumentsPipeline containing approximately rate * N documents from the original N-document input stream. The exact count varies due to random sampling.
Usage Examples
Keep 10% of Documents
from datatrove.pipeline.filters import SamplerFilter
# Randomly keep ~10% of documents
sampler = SamplerFilter(rate=0.1)
Reproducible Sampling
from datatrove.pipeline.filters import SamplerFilter
# Same seed produces the same sample every time
sampler = SamplerFilter(rate=0.01, seed=42)
Pipeline with Sampling for Development
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.extractors import Trafilatura
from datatrove.pipeline.filters import LanguageFilter, SamplerFilter, URLFilter
from datatrove.pipeline.readers import WarcReader
# Sample 1% of data for fast iteration during development
pipeline = LocalPipelineExecutor(
pipeline=[
WarcReader("s3://commoncrawl/crawl-data/CC-MAIN-2024-10/"),
SamplerFilter(rate=0.01, seed=42),
Trafilatura(favour_precision=True, timeout=1),
URLFilter(),
LanguageFilter(languages=["en"], language_threshold=0.65),
],
tasks=10,
)
pipeline.run()
Saving Excluded Documents
from datatrove.pipeline.filters import SamplerFilter
from datatrove.pipeline.writers import JsonlWriter
# Keep 50%, write the other 50% to disk for analysis
sampler = SamplerFilter(
rate=0.5,
seed=123,
exclusion_writer=JsonlWriter("s3://my-bucket/sampled-out/"),
)
Related Pages
- Huggingface_Datatrove_Random_Sampling (principle) -- The principle this implementation realizes
- Huggingface_Datatrove_LanguageFilter (related filter) -- Language filtering that may precede or follow sampling
- Huggingface_Datatrove_URLFilter (related filter) -- URL filtering typically applied before sampling