Principle:Huggingface Datatrove Unigram Log Probability Filtering
| Knowledge Sources | |
|---|---|
| Domains | Statistical NLP, Data Quality, Text Filtering |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Unigram Log Probability Filtering is a statistical technique for assessing document quality by computing the average log probability of its words under a unigram language model derived from word frequency data.
Description
The fundamental insight behind unigram log probability filtering is that well-formed text in a given language will predominantly consist of words that appear frequently in that language. By computing the average log probability of a document's words under a unigram (single-word) frequency distribution, one obtains a simple but effective quality score. Documents with high average log probabilities are composed of common words and are likely legitimate text, while documents with low scores are likely garbled, heavily misspelled, or in the wrong language.
This technique was popularized in the context of large-scale scientific document filtering by Allen AI's peS2o dataset, where it served as a computationally cheap pre-filter before more expensive quality assessments. The simplicity of the unigram model (no context, no word order, just individual word frequencies) makes it extremely fast to compute, suitable for filtering millions of documents.
The word frequency data typically comes from large reference corpora such as the Google 1T Web corpus, which provides frequency counts for millions of English words. Each word's probability is estimated as its count divided by the total corpus count, and the log probability is taken for numerical stability and to make averaging meaningful across documents of different lengths.
Usage
Apply unigram log probability filtering as an early-stage quality gate in text processing pipelines to remove obviously bad documents (encoding artifacts, random characters, heavily non-English text) before more computationally expensive filtering stages.
Theoretical Basis
Unigram Language Model: A unigram language model assigns probability to text by treating each word as independent. The probability of a document is the product of individual word probabilities, and the log probability is the sum of individual word log probabilities. The average log probability normalizes by document length, making scores comparable across documents of different sizes.
Log Probability Scoring: For a document with words w_1, w_2, ..., w_n, the score is computed as:
score = (1/n) * sum(log(P(w_i))) for i = 1 to n
where P(w_i) is the unigram probability of word w_i from the reference frequency table. Words not found in the frequency table receive a small default probability (e.g., 1e-9) to avoid undefined log values while heavily penalizing out-of-vocabulary words.
Threshold Selection: The threshold parameter (default: -10) controls the strictness of the filter. Lower thresholds are more permissive, allowing documents with rarer words. Higher thresholds are stricter, requiring documents to consist predominantly of common words. The optimal threshold depends on the corpus and the downstream task.
Frequency Estimation: Word frequencies are derived from large reference corpora (such as the Google 1T dataset containing word frequencies from approximately 1 trillion tokens of web text). The relative frequency of each word (count / total count) serves as the maximum likelihood estimate of the unigram probability.
Limitations: The unigram model has no awareness of word order, grammar, or semantic coherence. It also assumes a single reference language (typically English). Documents in other languages will score poorly regardless of their quality. For multilingual pipelines, language detection should precede or replace unigram log probability filtering.