Implementation:Huggingface Datatrove LineStats
Overview
LineStats is a pipeline step that computes line-level statistical metrics for each document in a DocumentsPipeline. It extends BaseStats and produces per-rank JSON output files containing aggregated line-level statistics.
Signature
class LineStats(BaseStats):
def __init__(
self,
output_folder: DataFolderLike,
max_k_chars_per_line_tresholds: list[int] | None = None,
min_k_chars_per_line_thresholds: list[int] | None = None,
groups_to_compute: list[GROUP] = list(get_args(GROUP)),
ignore_empty_lines: bool = False,
histogram_round_digits: int = 3,
top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
) -> None:
Import
from datatrove.pipeline.stats import LineStats
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
output_folder |
DataFolderLike |
(required) | Path or data folder where per-rank statistics JSON files will be written. |
max_k_chars_per_line_tresholds |
None | [10, 30] |
Character length thresholds for short line ratio computation. A line with length <= threshold is counted as short. Note: the parameter name contains a deliberate typo ("tresholds") matching the source code. |
min_k_chars_per_line_thresholds |
None | [2000, 10000] |
Character length thresholds for long line ratio computation. A line with length >= threshold is counted as long. |
groups_to_compute |
list[GROUP] |
["summary", "histogram", "fqdn", "suffix"] |
List of grouping strategies for aggregating statistics. |
ignore_empty_lines |
bool |
False |
Whether to exclude empty lines from ratio computations. Empty lines are always included in n_lines regardless of this setting.
|
histogram_round_digits |
int |
3 |
Number of decimal digits to round histogram bin values to, controlling histogram granularity. |
top_k_config |
TopKConfig |
DEFAULT_TOP_K_CONFIG |
Configuration for top-k truncation of high-cardinality groups (fqdn, suffix). Default retains top 100,000 keys. |
Available Stats
The following statistics are extracted per document via the extract_stats method:
| Stat Name | Description |
|---|---|
n_lines |
Number of lines in the document (always includes empty lines). |
avg_line_length |
Average character length of lines: sum(len(line) for line in lines) / len(lines).
|
short_line_ratio_chars_{chars} |
Ratio of lines with length <= {chars} characters. One stat per threshold in max_k_chars_per_line_tresholds.
|
long_line_ratio_chars_{chars} |
Ratio of lines with length >= {chars} characters. One stat per threshold in min_k_chars_per_line_thresholds.
|
lines_ending_with_terminal_mark_ratio |
Ratio of lines ending with terminal punctuation (END_PUNCTUATION from the C4 filter).
|
bullet_point_lines_ratio |
Ratio of lines starting with bullet characters (-, *, or the Unicode bullet character).
|
line_duplicates |
Ratio of lines that are exact duplicates of other lines in the document. |
line_char_duplicates |
Ratio of characters belonging to duplicated lines, relative to total character count. |
I/O
Input: DocumentsPipeline -- a stream of Document objects. Each document must have a text attribute. Lines can optionally be pre-computed in doc.metadata["lines"]; otherwise, the text is split on "\n".
Output: Per-rank JSON files written to output_folder/{group}/{stat_name}/{rank:05d}.json. Each JSON file contains a MetricStatsDict mapping keys to MetricStats objects. Documents are yielded downstream with extracted stats added to doc.metadata.
Key Implementation Details
- Lines are obtained from
doc.metadata.get("lines")if present, otherwise by splittingdoc.texton"\n". This allows upstream steps to pre-compute line splits. - The
n_linesstat always includes empty lines, even whenignore_empty_lines=True. The filtering only affects the denominator for ratio computations. - Line duplication detection uses
find_duplicatesfrom the Gopher repetition filter module, which returns both line-level and character-level duplicate counts. - Bullet point detection checks the first non-whitespace character of each line against the set
{"-", "*", "•"}. Empty lines (after stripping) are not counted as bullet lines. - Terminal punctuation is detected using the
END_PUNCTUATIONtuple fromdatatrove.pipeline.filters.c4_filters, checked vialine.endswith(END_PUNCTUATION).
Example Usage
from datatrove.pipeline.stats import LineStats
# Basic usage with default thresholds
line_stats = LineStats(
output_folder="s3://my-bucket/stats/line_stats",
)
# Custom thresholds, ignoring empty lines for ratios
line_stats = LineStats(
output_folder="/data/stats/line_stats",
max_k_chars_per_line_tresholds=[5, 10, 20],
min_k_chars_per_line_thresholds=[500, 1000, 5000],
ignore_empty_lines=True,
groups_to_compute=["summary", "histogram"],
)