Implementation:Datajuicer Data juicer TokenNumFilter
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Filtering |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for filtering data samples based on token count provided by Data-Juicer.
Description
TokenNumFilter is a filter operator that keeps samples whose token count (as produced by a HuggingFace tokenizer) falls within a specified range. It extends Filter and uses the two-phase compute_stats/process pattern. It uses a HuggingFace tokenizer (default: EleutherAI/pythia-6.9b-deduped) to tokenize the text via get_words_from_document, counts the resulting tokens, and caches the count under num_token. More precise than character-length filtering since token counts directly correspond to model input sizes. Essential for ensuring training samples fit within context window limits.
Usage
Import when filtering based on token count. Configure in YAML or Python.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File: data_juicer/ops/filter/token_num_filter.py
Signature
@OPERATORS.register_module("token_num_filter")
class TokenNumFilter(Filter):
def __init__(self, hf_tokenizer: str = "EleutherAI/pythia-6.9b-deduped", min_num: int = 10, max_num: int = sys.maxsize, *args, **kwargs):
Import
from data_juicer.ops.filter.token_num_filter import TokenNumFilter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hf_tokenizer | str | No | HuggingFace tokenizer name (default: "EleutherAI/pythia-6.9b-deduped") |
| min_num | int | No | Minimum token count to keep samples (default: 10) |
| max_num | int | No | Maximum token count to keep samples (default: sys.maxsize) |
Outputs
| Name | Type | Description |
|---|---|---|
| samples | Dict | Filtered samples with num_token stat computed |
Usage Examples
YAML Configuration
process:
- token_num_filter:
hf_tokenizer: "EleutherAI/pythia-6.9b-deduped"
min_num: 10
max_num: 8192
Python API
from data_juicer.ops.filter.token_num_filter import TokenNumFilter
op = TokenNumFilter(min_num=10, max_num=8192)