Implementation:Huggingface Datatrove TokensCounter
| Sources | Domains | Last Updated |
|---|---|---|
| Huggingface Datatrove | Tokenization, Statistics | 2026-02-14 |
Overview
Pipeline step that counts tokens in each document using a HuggingFace fast tokenizer and stores the count in document metadata without saving tokenized output.
Description
TokensCounter extends PipelineStepWithTokenizer and provides a lightweight token counting pass. It tokenizes document text in batches using tokenizer.encode_batch, extracts the length of each encoding's ids list as the token count, optionally adds 1 for the EOS token, and stores the result in document.metadata["token_count"]. The document is then yielded downstream unchanged (except for the added metadata field).
The class tracks cumulative token statistics via stat_update("tokens", value=count) which populates the pipeline's stats dictionary for reporting. Time tracking is performed at the batch level via track_time(unit="batch").
Usage
Add to a pipeline at any point where token count statistics are needed. Commonly used after deduplication to report the final token-level dataset size, or before and after filtering stages to measure their impact.
Code Reference
Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/tokens/counter.py (L7-55)
Signature:
class TokensCounter(PipelineStepWithTokenizer):
def __init__(
self,
tokenizer_name_or_path: str = "gpt2",
count_eos_token: bool = False,
batch_size: int = 10000,
):
Import:
from datatrove.pipeline.tokens import TokensCounter
I/O Contract
Inputs:
| Parameter | Type | Required | Description |
|---|---|---|---|
| tokenizer_name_or_path | str | No | HuggingFace tokenizer name or local file path (default "gpt2") |
| count_eos_token | bool | No | Add 1 to count for EOS token per document (default False) |
| batch_size | int | No | Documents per tokenization batch (default 10000) |
Pipeline I/O:
- Input: DocumentsPipeline -- stream of Document objects with text
- Output: DocumentsPipeline -- same documents with token_count added to each document's metadata dictionary
Usage Examples
Example 1 -- Basic token counting with GPT-2:
from datatrove.pipeline.tokens import TokensCounter
counter = TokensCounter(
tokenizer_name_or_path="gpt2",
)
# After running, each document.metadata["token_count"] contains the count
Example 2 -- Token counting with EOS and custom tokenizer:
from datatrove.pipeline.tokens import TokensCounter
counter = TokensCounter(
tokenizer_name_or_path="meta-llama/Llama-2-7b-hf",
count_eos_token=True,
batch_size=5000,
)
Example 3 -- Using in a pipeline for before/after statistics:
from datatrove.pipeline.tokens import TokensCounter
from datatrove.pipeline.filters import GopherQualityFilter
pipeline = [
reader,
TokensCounter(tokenizer_name_or_path="gpt2"), # count before filtering
GopherQualityFilter(),
TokensCounter(tokenizer_name_or_path="gpt2"), # count after filtering
writer,
]