Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Datatrove TokensCounter

From Leeroopedia
Sources Domains Last Updated
Huggingface Datatrove Tokenization, Statistics 2026-02-14

Overview

Pipeline step that counts tokens in each document using a HuggingFace fast tokenizer and stores the count in document metadata without saving tokenized output.

Description

TokensCounter extends PipelineStepWithTokenizer and provides a lightweight token counting pass. It tokenizes document text in batches using tokenizer.encode_batch, extracts the length of each encoding's ids list as the token count, optionally adds 1 for the EOS token, and stores the result in document.metadata["token_count"]. The document is then yielded downstream unchanged (except for the added metadata field).

The class tracks cumulative token statistics via stat_update("tokens", value=count) which populates the pipeline's stats dictionary for reporting. Time tracking is performed at the batch level via track_time(unit="batch").

Usage

Add to a pipeline at any point where token count statistics are needed. Commonly used after deduplication to report the final token-level dataset size, or before and after filtering stages to measure their impact.

Code Reference

Source Location: Repository: huggingface/datatrove, File: src/datatrove/pipeline/tokens/counter.py (L7-55)

Signature:

class TokensCounter(PipelineStepWithTokenizer):
    def __init__(
        self,
        tokenizer_name_or_path: str = "gpt2",
        count_eos_token: bool = False,
        batch_size: int = 10000,
    ):

Import:

from datatrove.pipeline.tokens import TokensCounter

I/O Contract

Inputs:

Parameter Type Required Description
tokenizer_name_or_path str No HuggingFace tokenizer name or local file path (default "gpt2")
count_eos_token bool No Add 1 to count for EOS token per document (default False)
batch_size int No Documents per tokenization batch (default 10000)

Pipeline I/O:

  • Input: DocumentsPipeline -- stream of Document objects with text
  • Output: DocumentsPipeline -- same documents with token_count added to each document's metadata dictionary

Usage Examples

Example 1 -- Basic token counting with GPT-2:

from datatrove.pipeline.tokens import TokensCounter

counter = TokensCounter(
    tokenizer_name_or_path="gpt2",
)
# After running, each document.metadata["token_count"] contains the count

Example 2 -- Token counting with EOS and custom tokenizer:

from datatrove.pipeline.tokens import TokensCounter

counter = TokensCounter(
    tokenizer_name_or_path="meta-llama/Llama-2-7b-hf",
    count_eos_token=True,
    batch_size=5000,
)

Example 3 -- Using in a pipeline for before/after statistics:

from datatrove.pipeline.tokens import TokensCounter
from datatrove.pipeline.filters import GopherQualityFilter

pipeline = [
    reader,
    TokensCounter(tokenizer_name_or_path="gpt2"),  # count before filtering
    GopherQualityFilter(),
    TokensCounter(tokenizer_name_or_path="gpt2"),  # count after filtering
    writer,
]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment