Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove LineStats

From Leeroopedia

Template:Metadata

Overview

LineStats is a pipeline step that computes line-level statistical metrics for each document in a DocumentsPipeline. It extends BaseStats and produces per-rank JSON output files containing aggregated line-level statistics.

Signature

class LineStats(BaseStats):
    def __init__(
        self,
        output_folder: DataFolderLike,
        max_k_chars_per_line_tresholds: list[int] | None = None,
        min_k_chars_per_line_thresholds: list[int] | None = None,
        groups_to_compute: list[GROUP] = list(get_args(GROUP)),
        ignore_empty_lines: bool = False,
        histogram_round_digits: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None:

Import

from datatrove.pipeline.stats import LineStats

Parameters

Parameter Type Default Description
output_folder DataFolderLike (required) Path or data folder where per-rank statistics JSON files will be written.
max_k_chars_per_line_tresholds None [10, 30] Character length thresholds for short line ratio computation. A line with length <= threshold is counted as short. Note: the parameter name contains a deliberate typo ("tresholds") matching the source code.
min_k_chars_per_line_thresholds None [2000, 10000] Character length thresholds for long line ratio computation. A line with length >= threshold is counted as long.
groups_to_compute list[GROUP] ["summary", "histogram", "fqdn", "suffix"] List of grouping strategies for aggregating statistics.
ignore_empty_lines bool False Whether to exclude empty lines from ratio computations. Empty lines are always included in n_lines regardless of this setting.
histogram_round_digits int 3 Number of decimal digits to round histogram bin values to, controlling histogram granularity.
top_k_config TopKConfig DEFAULT_TOP_K_CONFIG Configuration for top-k truncation of high-cardinality groups (fqdn, suffix). Default retains top 100,000 keys.

Available Stats

The following statistics are extracted per document via the extract_stats method:

Stat Name Description
n_lines Number of lines in the document (always includes empty lines).
avg_line_length Average character length of lines: sum(len(line) for line in lines) / len(lines).
short_line_ratio_chars_{chars} Ratio of lines with length <= {chars} characters. One stat per threshold in max_k_chars_per_line_tresholds.
long_line_ratio_chars_{chars} Ratio of lines with length >= {chars} characters. One stat per threshold in min_k_chars_per_line_thresholds.
lines_ending_with_terminal_mark_ratio Ratio of lines ending with terminal punctuation (END_PUNCTUATION from the C4 filter).
bullet_point_lines_ratio Ratio of lines starting with bullet characters (-, *, or the Unicode bullet character).
line_duplicates Ratio of lines that are exact duplicates of other lines in the document.
line_char_duplicates Ratio of characters belonging to duplicated lines, relative to total character count.

I/O

Input: DocumentsPipeline -- a stream of Document objects. Each document must have a text attribute. Lines can optionally be pre-computed in doc.metadata["lines"]; otherwise, the text is split on "\n".

Output: Per-rank JSON files written to output_folder/{group}/{stat_name}/{rank:05d}.json. Each JSON file contains a MetricStatsDict mapping keys to MetricStats objects. Documents are yielded downstream with extracted stats added to doc.metadata.

Key Implementation Details

  • Lines are obtained from doc.metadata.get("lines") if present, otherwise by splitting doc.text on "\n". This allows upstream steps to pre-compute line splits.
  • The n_lines stat always includes empty lines, even when ignore_empty_lines=True. The filtering only affects the denominator for ratio computations.
  • Line duplication detection uses find_duplicates from the Gopher repetition filter module, which returns both line-level and character-level duplicate counts.
  • Bullet point detection checks the first non-whitespace character of each line against the set {"-", "*", "•"}. Empty lines (after stripping) are not counted as bullet lines.
  • Terminal punctuation is detected using the END_PUNCTUATION tuple from datatrove.pipeline.filters.c4_filters, checked via line.endswith(END_PUNCTUATION).

Example Usage

from datatrove.pipeline.stats import LineStats

# Basic usage with default thresholds
line_stats = LineStats(
    output_folder="s3://my-bucket/stats/line_stats",
)

# Custom thresholds, ignoring empty lines for ratios
line_stats = LineStats(
    output_folder="/data/stats/line_stats",
    max_k_chars_per_line_tresholds=[5, 10, 20],
    min_k_chars_per_line_thresholds=[500, 1000, 5000],
    ignore_empty_lines=True,
    groups_to_compute=["summary", "histogram"],
)

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment