Implementation:Huggingface Datatrove TokenStats
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, Natural Language Processing |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
TokenStats is a statistics pipeline step that counts the number of tokens in each document using a configurable subword tokenizer.
Description
TokenStats extends both BaseStats and PipelineStepWithTokenizer through multiple inheritance to provide token counting capabilities within the statistics framework. It uses a Hugging Face tokenizer (defaulting to GPT-2's tokenizer) to encode document text and counts the resulting tokens.
The class implements an efficient caching strategy: it first checks whether the document's metadata already contains a token_count field (which may have been set by a prior pipeline step). If present, the existing count is reused without re-tokenizing, avoiding redundant computation. If not present, the text is tokenized using self.tokenizer.encode(doc.text).tokens and the length of the resulting token list is used.
The output statistic token_count is an integer representing the number of subword tokens in the document. This metric is particularly valuable because token count is the fundamental unit of measurement for language model training budgets and is used in histogram grouping (the base class uses token_count metadata when generating histogram statistics).
TokenStats requires the tokenizers library in addition to the base tldextract dependency. The explicit use of BaseStats.__init__ and PipelineStepWithTokenizer.__init__ handles the diamond inheritance pattern cleanly.
Usage
Use TokenStats when you need to measure the token-level size of documents in your dataset. This is essential for estimating training costs, balancing dataset composition by token count, or providing token-weighted statistics in other stats computations.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/stats/token_stats.py
- Lines: 1-38
Signature
class TokenStats(BaseStats, PipelineStepWithTokenizer):
name = "🔗 Token counter"
_requires_dependencies = ["tokenizers"] + BaseStats._requires_dependencies
def __init__(
self,
output_folder: DataFolderLike,
tokenizer_name_or_path: str = "gpt2",
groups_to_compute: list[GROUP] = ["fqdn", "suffix", "summary", "histogram"],
histogram_rounding: int = 3,
top_k_config: TopKConfig = DEFAULT_TOP_K_CONFIG,
) -> None
def extract_stats(self, doc: Document) -> dict[str, int | float]:
...
Import
from datatrove.pipeline.stats.token_stats import TokenStats
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes | Folder where computed statistics will be saved |
| tokenizer_name_or_path | str | No | Hugging Face tokenizer identifier or path (default: "gpt2") |
| groups_to_compute | list[GROUP] | No | Grouping strategies for statistics (default: all four groups) |
| histogram_rounding | int | No | Decimal digits for histogram rounding (default: 3) |
| top_k_config | TopKConfig | No | Top-K configuration for high-cardinality groups |
Outputs
| Name | Type | Description |
|---|---|---|
| token_count | int | Number of subword tokens in the document as determined by the specified tokenizer |
Usage Examples
Basic Usage
from datatrove.pipeline.stats.token_stats import TokenStats
# Count tokens using the default GPT-2 tokenizer
stats = TokenStats(
output_folder="output/stats/",
)
Custom Tokenizer
from datatrove.pipeline.stats.token_stats import TokenStats
# Count tokens using a specific model's tokenizer
stats = TokenStats(
output_folder="output/stats/",
tokenizer_name_or_path="meta-llama/Llama-2-7b-hf",
groups_to_compute=["summary", "histogram"],
)