Principle:Huggingface Datatrove Document Level Statistics
Overview
Computing document-level character composition metrics for overall text quality assessment.
Description
Document-level statistics measure the character-level composition of entire documents. Unlike word-level and line-level statistics that analyze structural units, these metrics operate directly on the raw character stream, providing a high-level view of document composition. The following metrics are computed per document:
- Length (
length): The total number of characters in the document. This is the most basic document-level metric, useful for filtering extremely short or extremely long documents. - Whitespace ratio (
white_space_ratio): The fraction of characters that are whitespace (spaces, tabs, newlines, etc.). Abnormally high whitespace ratios may indicate sparse or padding-heavy content; very low ratios may indicate minified code or concatenated text. - Digit ratio (
digit_ratio): The fraction of characters that are digits (0-9). High digit ratios may indicate numerical data, tables, financial content, or phone number lists. - Uppercase ratio (
uppercase_ratio): The fraction of characters that are uppercase letters. High uppercase ratios may indicate all-caps text, headers-only content, or acronym-heavy documents. - Ellipsis ratio (
elipsis_ratio): The fraction of characters belonging to ellipsis sequences (...or the Unicode ellipsis character). High ellipsis ratios may indicate truncated content, clickbait, or informally written text. - Punctuation ratio (
punctuation_ratio): The fraction of characters that are punctuation marks. This captures the density of punctuation in the text, which varies by content type (e.g., code has different punctuation patterns than prose). - Non-alpha-digit ratio (
non_alpha_digit_ratio): The fraction of characters that are neither alphabetic nor numeric. This is a broad measure of "special character" density that captures whitespace, punctuation, symbols, and control characters collectively.
These metrics provide a complementary perspective to word-level and line-level analysis, enabling detection of composition anomalies that operate at the character level.
Usage
Document-level statistics are computed as part of the summary statistics pipeline for dataset quality analysis. They are typically run alongside word-level and line-level statistics to build a comprehensive profile of dataset composition.
Typical use cases include:
- Filtering documents with abnormal character composition (e.g., mostly digits, mostly uppercase)
- Detecting non-textual content such as encoded data, numerical tables, or formatted data dumps
- Characterizing dataset composition across character-level dimensions
- Identifying documents with excessive punctuation or special characters that may indicate low-quality web scrapes
Theoretical Basis
Character composition ratios provide a simple, language-agnostic quality signal. Each ratio is computed as:
ratio = count_of_matching_characters / total_characters
For ellipsis and punctuation, the matching is performed via regular expressions compiled at initialization time. The ellipsis pattern matches both the three-dot sequence (...) and the Unicode ellipsis character. The punctuation pattern matches characters from a predefined PUNCTUATION set. In both cases, the ratio counts the total characters consumed by matches (not the number of matches), so a three-character ... contributes 3 to the numerator.
These ratios are simple, fast to compute, and highly effective as quality signals. Natural language text in most languages has characteristic ranges for these metrics, making outliers easy to identify.