Implementation:Huggingface Datatrove UnigramLogProbFilter
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Text Filtering, Statistical NLP |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
UnigramLogProbFilter is a document filter that computes the average unigram log probability of a document's words based on English word frequency data, and drops documents that fall below a configurable threshold.
Description
UnigramLogProbFilter extends BaseFilter to provide a statistical quality filter based on word frequency distributions. The idea, adapted from the Allen AI peS2o dataset, is that documents composed of common, real English words will have higher average unigram log probabilities, while documents containing gibberish, encoding artifacts, or non-English text will have lower scores.
At initialization, the filter downloads a unigram frequency file from the Google 1T corpus (hosted on S3) and computes relative word frequencies. The frequency file is cached locally using Hugging Face Hub's cached_assets_path utility, so it is only downloaded once. Each word's frequency is its count divided by the total count across all words in the corpus.
When filtering a document, the text is split into words using Datatrove's split_into_words utility (which is language-aware), and each word's log probability is looked up from the frequency table. Words not found in the frequency table receive a default frequency of 1e-9. The average of all word log probabilities is computed and compared against the logprobs_threshold (default: -10). Documents with an average log probability above the threshold are kept; those below are dropped.
Usage
Use UnigramLogProbFilter to remove low-quality or garbled documents from English text corpora. It is effective at catching documents with corrupted encoding, random character sequences, or heavily non-English content that would degrade language model training.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/filters/unigram_log_probs.py
- Lines: 1-79
Signature
class UnigramLogProbFilter(BaseFilter):
name = "🧑🍳 Unigram log-prob filter"
def __init__(
self,
logprobs_threshold: float = -10,
exclusion_writer: DiskWriter = None,
language: str = Languages.english,
):
...
def get_frequencies(self):
...
def get_logprob(self, doc):
...
def filter(self, doc: Document) -> bool:
...
Import
from datatrove.pipeline.filters.unigram_log_probs import UnigramLogProbFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| logprobs_threshold | float | No | Minimum average unigram log probability to keep a document (default: -10) |
| exclusion_writer | DiskWriter | No | Optional writer for saving dropped documents |
| language | str | No | Language for word splitting (default: Languages.english) |
Outputs
| Name | Type | Description |
|---|---|---|
| data | DocumentsPipeline (generator) | Yields documents whose average unigram log probability exceeds the threshold |
Usage Examples
Basic Usage
from datatrove.pipeline.filters.unigram_log_probs import UnigramLogProbFilter
# Use default threshold of -10
quality_filter = UnigramLogProbFilter()
# Use a stricter threshold
strict_filter = UnigramLogProbFilter(logprobs_threshold=-8)