Implementation:Huggingface Datatrove TokenStats

Knowledge Sources	Huggingface_Datatrove
Domains	Data Quality, Natural Language Processing
Last Updated	2026-02-14 17:00 GMT

Overview

TokenStats is a statistics pipeline step that counts the number of tokens in each document using a configurable subword tokenizer.

Description

TokenStats extends both BaseStats and PipelineStepWithTokenizer through multiple inheritance to provide token counting capabilities within the statistics framework. It uses a Hugging Face tokenizer (defaulting to GPT-2's tokenizer) to encode document text and counts the resulting tokens.

The class implements an efficient caching strategy: it first checks whether the document's metadata already contains a token_count field (which may have been set by a prior pipeline step). If present, the existing count is reused without re-tokenizing, avoiding redundant computation. If not present, the text is tokenized using self.tokenizer.encode(doc.text).tokens and the length of the resulting token list is used.

The output statistic token_count is an integer representing the number of subword tokens in the document. This metric is particularly valuable because token count is the fundamental unit of measurement for language model training budgets and is used in histogram grouping (the base class uses token_count metadata when generating histogram statistics).

TokenStats requires the tokenizers library in addition to the base tldextract dependency. The explicit use of BaseStats.__init__ and PipelineStepWithTokenizer.__init__ handles the diamond inheritance pattern cleanly.

Usage

Use TokenStats when you need to measure the token-level size of documents in your dataset. This is essential for estimating training costs, balancing dataset composition by token count, or providing token-weighted statistics in other stats computations.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/stats/token_stats.py
Lines: 1-38

Signature

class TokenStats(BaseStats, PipelineStepWithTokenizer):
    name = "🔗 Token counter"
    _requires_dependencies = ["tokenizers"] + BaseStats._requires_dependencies

    def __init__(
        self,
        output_folder: DataFolderLike,
        tokenizer_name_or_path: str = "gpt2",
        groups_to_compute: list[GROUP] = ["fqdn", "suffix", "summary", "histogram"],
        histogram_rounding: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None

    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        ...

Import

from datatrove.pipeline.stats.token_stats import TokenStats

I/O Contract

Inputs

Name	Type	Required	Description
output_folder	DataFolderLike	Yes	Folder where computed statistics will be saved
tokenizer_name_or_path	str	No	Hugging Face tokenizer identifier or path (default: "gpt2")
groups_to_compute	list[GROUP]	No	Grouping strategies for statistics (default: all four groups)
histogram_rounding	int	No	Decimal digits for histogram rounding (default: 3)
top_k_config	TopKConfig	No	Top-K configuration for high-cardinality groups

Outputs

Name	Type	Description
token_count	int	Number of subword tokens in the document as determined by the specified tokenizer

Usage Examples

Basic Usage

from datatrove.pipeline.stats.token_stats import TokenStats

# Count tokens using the default GPT-2 tokenizer
stats = TokenStats(
    output_folder="output/stats/",
)

Custom Tokenizer

from datatrove.pipeline.stats.token_stats import TokenStats

# Count tokens using a specific model's tokenizer
stats = TokenStats(
    output_folder="output/stats/",
    tokenizer_name_or_path="meta-llama/Llama-2-7b-hf",
    groups_to_compute=["summary", "histogram"],
)

Related Pages

Principle:Huggingface_Datatrove_Token_Statistics

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment