Implementation:Huggingface Datatrove UnigramLogProbFilter

Knowledge Sources	Huggingface_Datatrove
Domains	Data Processing, Text Filtering, Statistical NLP
Last Updated	2026-02-14 17:00 GMT

Overview

UnigramLogProbFilter is a document filter that computes the average unigram log probability of a document's words based on English word frequency data, and drops documents that fall below a configurable threshold.

Description

UnigramLogProbFilter extends BaseFilter to provide a statistical quality filter based on word frequency distributions. The idea, adapted from the Allen AI peS2o dataset, is that documents composed of common, real English words will have higher average unigram log probabilities, while documents containing gibberish, encoding artifacts, or non-English text will have lower scores.

At initialization, the filter downloads a unigram frequency file from the Google 1T corpus (hosted on S3) and computes relative word frequencies. The frequency file is cached locally using Hugging Face Hub's cached_assets_path utility, so it is only downloaded once. Each word's frequency is its count divided by the total count across all words in the corpus.

When filtering a document, the text is split into words using Datatrove's split_into_words utility (which is language-aware), and each word's log probability is looked up from the frequency table. Words not found in the frequency table receive a default frequency of 1e-9. The average of all word log probabilities is computed and compared against the logprobs_threshold (default: -10). Documents with an average log probability above the threshold are kept; those below are dropped.

Usage

Use UnigramLogProbFilter to remove low-quality or garbled documents from English text corpora. It is effective at catching documents with corrupted encoding, random character sequences, or heavily non-English content that would degrade language model training.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/filters/unigram_log_probs.py
Lines: 1-79

Signature

class UnigramLogProbFilter(BaseFilter):
    name = "🧑‍🍳 Unigram log-prob filter"

    def __init__(
        self,
        logprobs_threshold: float = -10,
        exclusion_writer: DiskWriter = None,
        language: str = Languages.english,
    ):
        ...

    def get_frequencies(self):
        ...

    def get_logprob(self, doc):
        ...

    def filter(self, doc: Document) -> bool:
        ...

Import

from datatrove.pipeline.filters.unigram_log_probs import UnigramLogProbFilter

I/O Contract

Inputs

Name	Type	Required	Description
logprobs_threshold	float	No	Minimum average unigram log probability to keep a document (default: -10)
exclusion_writer	DiskWriter	No	Optional writer for saving dropped documents
language	str	No	Language for word splitting (default: Languages.english)

Outputs

Name	Type	Description
data	DocumentsPipeline (generator)	Yields documents whose average unigram log probability exceeds the threshold

Usage Examples

Basic Usage

from datatrove.pipeline.filters.unigram_log_probs import UnigramLogProbFilter

# Use default threshold of -10
quality_filter = UnigramLogProbFilter()

# Use a stricter threshold
strict_filter = UnigramLogProbFilter(logprobs_threshold=-8)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment