Implementation:Huggingface Datatrove LineStats

Overview

LineStats is a pipeline step that computes line-level statistical metrics for each document in a DocumentsPipeline. It extends BaseStats and produces per-rank JSON output files containing aggregated line-level statistics.

Signature

class LineStats(BaseStats):
    def __init__(
        self,
        output_folder: DataFolderLike,
        max_k_chars_per_line_tresholds: list[int] | None = None,
        min_k_chars_per_line_thresholds: list[int] | None = None,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        ignore_empty_lines: bool = False,
        histogram_round_digits: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None:

Import

from datatrove.pipeline.stats import LineStats

Parameters

Parameter	Type	Default	Description
`output_folder`	`DataFolderLike`	(required)	Path or data folder where per-rank statistics JSON files will be written.
`max_k_chars_per_line_tresholds`	None	`[10, 30]`	Character length thresholds for short line ratio computation. A line with length <= threshold is counted as short. Note: the parameter name contains a deliberate typo ("tresholds") matching the source code.
`min_k_chars_per_line_thresholds`	None	`[2000, 10000]`	Character length thresholds for long line ratio computation. A line with length >= threshold is counted as long.
`groups_to_compute`	`list[GROUP]`	`["summary", "histogram", "fqdn", "suffix"]`	List of grouping strategies for aggregating statistics.
`ignore_empty_lines`	`bool`	`False`	Whether to exclude empty lines from ratio computations. Empty lines are always included in `n_lines` regardless of this setting.
`histogram_round_digits`	`int`	`3`	Number of decimal digits to round histogram bin values to, controlling histogram granularity.
`top_k_config`	`TopKConfig`	`DEFAULT_TOP_K_CONFIG`	Configuration for top-k truncation of high-cardinality groups (fqdn, suffix). Default retains top 100,000 keys.

Available Stats

The following statistics are extracted per document via the extract_stats method:

Stat Name	Description
`n_lines`	Number of lines in the document (always includes empty lines).
`avg_line_length`	Average character length of lines: `sum(len(line) for line in lines) / len(lines)`.
`short_line_ratio_chars_{chars}`	Ratio of lines with length <= `{chars}` characters. One stat per threshold in `max_k_chars_per_line_tresholds`.
`long_line_ratio_chars_{chars}`	Ratio of lines with length >= `{chars}` characters. One stat per threshold in `min_k_chars_per_line_thresholds`.
`lines_ending_with_terminal_mark_ratio`	Ratio of lines ending with terminal punctuation (`END_PUNCTUATION` from the C4 filter).
`bullet_point_lines_ratio`	Ratio of lines starting with bullet characters (`-`, `*`, or the Unicode bullet character).
`line_duplicates`	Ratio of lines that are exact duplicates of other lines in the document.
`line_char_duplicates`	Ratio of characters belonging to duplicated lines, relative to total character count.

I/O

Input: DocumentsPipeline -- a stream of Document objects. Each document must have a text attribute. Lines can optionally be pre-computed in doc.metadata["lines"]; otherwise, the text is split on "\n".

Output: Per-rank JSON files written to output_folder/{group}/{stat_name}/{rank:05d}.json. Each JSON file contains a MetricStatsDict mapping keys to MetricStats objects. Documents are yielded downstream with extracted stats added to doc.metadata.

Key Implementation Details

Lines are obtained from doc.metadata.get("lines") if present, otherwise by splitting doc.text on "\n". This allows upstream steps to pre-compute line splits.
The n_lines stat always includes empty lines, even when ignore_empty_lines=True. The filtering only affects the denominator for ratio computations.
Line duplication detection uses find_duplicates from the Gopher repetition filter module, which returns both line-level and character-level duplicate counts.
Bullet point detection checks the first non-whitespace character of each line against the set {"-", "*", "•"}. Empty lines (after stripping) are not counted as bullet lines.
Terminal punctuation is detected using the END_PUNCTUATION tuple from datatrove.pipeline.filters.c4_filters, checked via line.endswith(END_PUNCTUATION).

Example Usage

from datatrove.pipeline.stats import LineStats

# Basic usage with default thresholds
line_stats = LineStats(
    output_folder="s3://my-bucket/stats/line_stats",
)

# Custom thresholds, ignoring empty lines for ratios
line_stats = LineStats(
    output_folder="/data/stats/line_stats",
    max_k_chars_per_line_tresholds=[5, 10, 20],
    min_k_chars_per_line_thresholds=[500, 1000, 5000],
    ignore_empty_lines=True,
    groups_to_compute=["summary", "histogram"],
)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment