Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove TokenStats

From Leeroopedia
Knowledge Sources
Domains Data Quality, Natural Language Processing
Last Updated 2026-02-14 17:00 GMT

Overview

TokenStats is a statistics pipeline step that counts the number of tokens in each document using a configurable subword tokenizer.

Description

TokenStats extends both BaseStats and PipelineStepWithTokenizer through multiple inheritance to provide token counting capabilities within the statistics framework. It uses a Hugging Face tokenizer (defaulting to GPT-2's tokenizer) to encode document text and counts the resulting tokens.

The class implements an efficient caching strategy: it first checks whether the document's metadata already contains a token_count field (which may have been set by a prior pipeline step). If present, the existing count is reused without re-tokenizing, avoiding redundant computation. If not present, the text is tokenized using self.tokenizer.encode(doc.text).tokens and the length of the resulting token list is used.

The output statistic token_count is an integer representing the number of subword tokens in the document. This metric is particularly valuable because token count is the fundamental unit of measurement for language model training budgets and is used in histogram grouping (the base class uses token_count metadata when generating histogram statistics).

TokenStats requires the tokenizers library in addition to the base tldextract dependency. The explicit use of BaseStats.__init__ and PipelineStepWithTokenizer.__init__ handles the diamond inheritance pattern cleanly.

Usage

Use TokenStats when you need to measure the token-level size of documents in your dataset. This is essential for estimating training costs, balancing dataset composition by token count, or providing token-weighted statistics in other stats computations.

Code Reference

Source Location

Signature

class TokenStats(BaseStats, PipelineStepWithTokenizer):
    name = "🔗 Token counter"
    _requires_dependencies = ["tokenizers"] + BaseStats._requires_dependencies

    def __init__(
        self,
        output_folder: DataFolderLike,
        tokenizer_name_or_path: str = "gpt2",
        groups_to_compute: list[GROUP] = ["fqdn", "suffix", "summary", "histogram"],
        histogram_rounding: int = 3,
        top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
    ) -> None

    def extract_stats(self, doc: Document) -> dict[str, int | float]:
        ...

Import

from datatrove.pipeline.stats.token_stats import TokenStats

I/O Contract

Inputs

Name Type Required Description
output_folder DataFolderLike Yes Folder where computed statistics will be saved
tokenizer_name_or_path str No Hugging Face tokenizer identifier or path (default: "gpt2")
groups_to_compute list[GROUP] No Grouping strategies for statistics (default: all four groups)
histogram_rounding int No Decimal digits for histogram rounding (default: 3)
top_k_config TopKConfig No Top-K configuration for high-cardinality groups

Outputs

Name Type Description
token_count int Number of subword tokens in the document as determined by the specified tokenizer

Usage Examples

Basic Usage

from datatrove.pipeline.stats.token_stats import TokenStats

# Count tokens using the default GPT-2 tokenizer
stats = TokenStats(
    output_folder="output/stats/",
)

Custom Tokenizer

from datatrove.pipeline.stats.token_stats import TokenStats

# Count tokens using a specific model's tokenizer
stats = TokenStats(
    output_folder="output/stats/",
    tokenizer_name_or_path="meta-llama/Llama-2-7b-hf",
    groups_to_compute=["summary", "histogram"],
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment