Principle:Huggingface Datatrove Line Level Statistics

Overview

Computing line-level statistical metrics for documents to assess structural text quality.

Description

Line-level statistics measure properties of the line structure within documents. While word-level statistics capture vocabulary characteristics, line-level metrics reveal the structural organization of text, which is a strong signal for content type and quality. The following metrics are computed per document:

Line count (n_lines): The total number of lines in the document, including empty lines. This provides a basic measure of document structure.
Average line length (avg_line_length): The mean character length of lines. Very short average line lengths may indicate list-heavy or fragmented content, while very long averages may indicate unformatted prose or concatenated text.
Short line ratio (short_line_ratio_chars_{chars}): The fraction of lines with character count at or below a configurable threshold. High short line ratios can indicate navigation menus, lists, or boilerplate content.
Long line ratio (long_line_ratio_chars_{chars}): The fraction of lines with character count at or above a configurable threshold. High long line ratios may indicate minified code, very long paragraphs, or improperly parsed content.
Terminal punctuation ratio (lines_ending_with_terminal_mark_ratio): The fraction of lines ending with terminal punctuation marks (period, exclamation, question mark, etc.). Natural prose paragraphs typically end with terminal punctuation; low ratios suggest structural content like headers, lists, or tables.
Bullet point ratio (bullet_point_lines_ratio): The fraction of lines beginning with bullet point characters (-, *, or the bullet character). High ratios indicate list-dominated content.
Line duplicates (line_duplicates): The fraction of lines that appear more than once in the document. High duplication ratios indicate boilerplate, repeated headers/footers, or low-quality content.
Line character duplicates (line_char_duplicates): The fraction of total characters that belong to duplicated lines. This weights duplication by line length, giving more importance to repeated long lines than repeated short ones.

These metrics are particularly useful for detecting list-heavy, boilerplate, or structurally degenerate content that may not be caught by word-level analysis alone.

Usage

Line-level statistics are computed as part of the summary statistics pipeline for dataset quality analysis. They complement word-level and document-level statistics to provide a comprehensive picture of text structure.

Typical use cases include:

Detecting boilerplate-heavy documents through line duplication metrics
Identifying list-dominated or navigation-heavy web pages via bullet point and short line ratios
Filtering structurally degenerate content where most lines lack terminal punctuation
Establishing structural quality thresholds for large web-crawl datasets

Theoretical Basis

Line duplication detection reuses the duplicate-finding algorithm from the Gopher repetition filter (find_duplicates), which identifies exact duplicate lines within a document. The two duplication metrics capture different aspects:

Line duplicates counts the fraction of lines that are duplicated (unweighted by length).
Line character duplicates weights by the character content of duplicated lines, so repeated long lines contribute more than repeated short lines.

Terminal punctuation detection uses the END_PUNCTUATION set from the C4 filter module, which includes standard sentence-ending characters appropriate for English and many other languages.

The optional ignore_empty_lines parameter controls whether empty lines are excluded from ratio computations (though they are always counted in n_lines), allowing the metrics to focus on substantive content lines.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment