Principle:Huggingface Datatrove Random Sampling
| Property | Value |
|---|---|
| Principle Name | Random_Sampling |
| Overview | Probabilistically sampling a fraction of documents from a data stream for analysis or reduced processing |
| Domains | Data_Sampling, Statistics, NLP |
| Related Implementation | Huggingface_Datatrove_SamplerFilter |
| Knowledge Sources | Huggingface_Datatrove |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Random sampling keeps each document in a data pipeline with independent probability p, producing an approximate p-fraction of the original dataset. This is a fundamental data reduction technique used in large-scale data processing to create manageable subsets for analysis, evaluation, or reduced-cost processing.
Description
Random sampling in datatrove operates as a streaming filter: each document is independently retained with probability equal to the configured rate parameter. The decision for each document is made by drawing a uniform random number in [0, 1) and comparing it to the rate threshold.
Key properties:
- Independence -- Each document's retention decision is independent of all other documents. There is no inter-document coordination or fixed output size.
- Approximate fraction -- For a dataset of N documents with rate p, the expected number of retained documents is Np, with standard deviation sqrt(Np(1-p)). For large N, the actual fraction closely approximates p.
- Reproducibility -- An optional integer seed can be provided to make the sampling deterministic. Given the same seed and the same input stream order, the same subset of documents will be selected.
- Streaming compatibility -- Because each decision is independent, the filter operates in a single pass over the data stream with O(1) memory overhead (beyond the random number generator state).
Usage
Random sampling is used in several contexts within a data processing pipeline:
- Before statistics computation -- To reduce the cost of computing corpus-level statistics (e.g., token frequency distributions, quality score histograms) by operating on a representative subset.
- Creating evaluation subsets -- To produce smaller, randomly selected test sets for evaluating pipeline changes or model quality.
- Development and debugging -- To quickly iterate on pipeline logic with a small fraction of the full dataset.
- Cost reduction -- To reduce the volume of data flowing through expensive downstream processing steps (e.g., model-based quality scoring).
Theoretical Basis
- Bernoulli sampling -- Each document undergoes an independent Bernoulli trial with success probability p (the rate). This is the simplest form of random sampling and is ideally suited to streaming data where the total population size is unknown in advance.
- Uniform random number generation -- The implementation uses
numpy.random.default_rng(seed).uniform(), which is based on the PCG64 pseudorandom number generator. This provides high-quality randomness with a period of 2^128, sufficient for any practical dataset size. - Expected sample size -- For a population of N documents, the expected sample size is E[n] = Np with variance Var[n] = Np(1-p). The relative standard deviation decreases as 1/sqrt(Np), so for large datasets the sample fraction converges tightly to the target rate.
- Unbiasedness -- Bernoulli sampling produces an unbiased sample: the inclusion probability is identical for every document regardless of its position in the stream or its content. This ensures that downstream statistics computed on the sample are unbiased estimators of the corresponding population statistics.
- Seed-based reproducibility -- Providing a fixed seed ensures that the pseudorandom sequence is identical across runs. This is essential for reproducible experiments, where the same sample must be selected when re-running a pipeline with identical input data.
Related Pages
- Huggingface_Datatrove_SamplerFilter (implements this principle) -- Concrete filter class for random document sampling
- Huggingface_Datatrove_Language_Filtering (upstream step) -- Language filtering that may precede sampling
- Huggingface_Datatrove_URL_Filtering (upstream step) -- URL filtering applied before sampling