Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove UnigramLogProbFilter

From Leeroopedia
Knowledge Sources
Domains Data Processing, Text Filtering, Statistical NLP
Last Updated 2026-02-14 17:00 GMT

Overview

UnigramLogProbFilter is a document filter that computes the average unigram log probability of a document's words based on English word frequency data, and drops documents that fall below a configurable threshold.

Description

UnigramLogProbFilter extends BaseFilter to provide a statistical quality filter based on word frequency distributions. The idea, adapted from the Allen AI peS2o dataset, is that documents composed of common, real English words will have higher average unigram log probabilities, while documents containing gibberish, encoding artifacts, or non-English text will have lower scores.

At initialization, the filter downloads a unigram frequency file from the Google 1T corpus (hosted on S3) and computes relative word frequencies. The frequency file is cached locally using Hugging Face Hub's cached_assets_path utility, so it is only downloaded once. Each word's frequency is its count divided by the total count across all words in the corpus.

When filtering a document, the text is split into words using Datatrove's split_into_words utility (which is language-aware), and each word's log probability is looked up from the frequency table. Words not found in the frequency table receive a default frequency of 1e-9. The average of all word log probabilities is computed and compared against the logprobs_threshold (default: -10). Documents with an average log probability above the threshold are kept; those below are dropped.

Usage

Use UnigramLogProbFilter to remove low-quality or garbled documents from English text corpora. It is effective at catching documents with corrupted encoding, random character sequences, or heavily non-English content that would degrade language model training.

Code Reference

Source Location

Signature

class UnigramLogProbFilter(BaseFilter):
    name = "🧑‍🍳 Unigram log-prob filter"

    def __init__(
        self,
        logprobs_threshold: float = -10,
        exclusion_writer: DiskWriter = None,
        language: str = Languages.english,
    ):
        ...

    def get_frequencies(self):
        ...

    def get_logprob(self, doc):
        ...

    def filter(self, doc: Document) -> bool:
        ...

Import

from datatrove.pipeline.filters.unigram_log_probs import UnigramLogProbFilter

I/O Contract

Inputs

Name Type Required Description
logprobs_threshold float No Minimum average unigram log probability to keep a document (default: -10)
exclusion_writer DiskWriter No Optional writer for saving dropped documents
language str No Language for word splitting (default: Languages.english)

Outputs

Name Type Description
data DocumentsPipeline (generator) Yields documents whose average unigram log probability exceeds the threshold

Usage Examples

Basic Usage

from datatrove.pipeline.filters.unigram_log_probs import UnigramLogProbFilter

# Use default threshold of -10
quality_filter = UnigramLogProbFilter()

# Use a stricter threshold
strict_filter = UnigramLogProbFilter(logprobs_threshold=-8)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment