Implementation:Turboderp org Exllamav2 Tokenize Calibration
| Knowledge Sources | |
|---|---|
| Domains | Quantization, NLP, Data_Preprocessing |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Concrete tool for tokenizing calibration text into a fixed-shape tensor of token IDs provided by exllamav2.
Description
The tokenize function reads a calibration dataset (either a user-supplied Parquet file or the built-in multi-domain standard calibration set), encodes the text using the model's tokenizer, and saves the resulting input_ids tensor as a safetensors file. When using the standard calibration set, it blends text from C4, Wikipedia, code, TinyStories, multilingual, technical, and random-token sources into a single matrix. Two tokenization passes are supported: one for measurement (fewer rows, shorter context) and one for the full quantization pass.
Usage
Call tokenize as the first step of the EXL2 conversion pipeline, once with measure=True to generate measurement-phase calibration data, and once with measure=False to generate quantization-phase calibration data.
Code Reference
Source Location
- Repository: exllamav2
- File:
exllamav2/conversion/tokenize.py - Lines: L39-62 (main
tokenizefunction), L8-36 (get_tokenshelper), L64-207 (get_standard_calibration)
Signature
def tokenize(job, save_fn, tokenizer, measure=False, noise_rows=None):
Supporting Functions
def get_tokens(num_rows, length, filename, tokenizer):
"""Read a Parquet file, tokenize all text, reshape into (num_rows, length)."""
def get_standard_calibration(job, measure, tokenizer, noise_rows=None):
"""Build a multi-domain calibration tensor from built-in UTF-8 text files."""
Import
from exllamav2.conversion.tokenize import tokenize
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| job | dict | Yes | Conversion job state dictionary. Key fields: cal_dataset (path to Parquet file or None for default), dataset_rows (int), measurement_rows (int), length (int), measurement_length (int), out_dir (str)
|
| save_fn | callable | Yes | Callback to persist job state (called externally, not within tokenize itself) |
| tokenizer | ExLlamaV2Tokenizer | Yes | The model's tokenizer instance used to encode text into token IDs |
| measure | bool | No | If True, use measurement_rows and measurement_length; if False (default), use dataset_rows and length
|
| noise_rows | tuple(int, int) or None | No | Optional pair (measure_noise, quant_noise) specifying number of noise rows to append
|
Outputs
| Name | Type | Description |
|---|---|---|
| cal_data.safetensors | File (safetensors) | Saved to job["out_dir"]/cal_data.safetensors containing tensor input_ids of shape (num_rows, length) with dtype torch.long
|
| job["cal_filename"] | str (side effect) | The path to the saved calibration file is written back into the job dict |
Standard Calibration Composition
When job["cal_dataset"] is None, the function assembles rows from built-in files:
| Source | File | Rows (measure) | Rows (quantize) | Notes |
|---|---|---|---|---|
| C4 | c4.utf8 | 2 | 10 | General web text, contiguous tokenization |
| Wikipedia | wiki.utf8 | 4 | 48 | Half without BOS, half with BOS prefix |
| Code | code.utf8 | 3 | 15 | Programming language samples |
| TinyStories | tiny.utf8 | 2 | 10 | Simple stories, half without BOS, half with BOS |
| Multilingual | multilingual.utf8 | 3 | 15 | Contiguous multilingual text |
| Multilingual (shuffled) | multilingual.utf8 | 1 | 5 | 128-token segments randomly shuffled |
| Random tokens | (generated) | 2 | 2 | Uniform random token IDs from full vocabulary |
| Technical | technical.utf8 | 2 | 10 | Scientific and mathematical text |
| Noise | (generated) | 0 | 0 | Optional negative-ones rows if noise_rows is set
|
Usage Examples
Basic Example
from exllamav2 import ExLlamaV2Tokenizer
from exllamav2.conversion.tokenize import tokenize
# Assume job dict and tokenizer are already configured
job = {
"cal_dataset": None, # Use built-in standard calibration
"dataset_rows": 100,
"measurement_rows": 16,
"length": 2048,
"measurement_length": 2048,
"out_dir": "/tmp/exl2_work",
}
def save_fn():
pass # Persist job state to disk
# Tokenize for measurement phase
tokenize(job, save_fn, tokenizer, measure=True)
# Tokenize for quantization phase
tokenize(job, save_fn, tokenizer, measure=False)
# Result: job["cal_filename"] points to the saved safetensors file
print(job["cal_filename"])
# /tmp/exl2_work/cal_data.safetensors
Custom Dataset Example
job = {
"cal_dataset": "/data/my_calibration.parquet",
"dataset_rows": 200,
"measurement_rows": 32,
"length": 4096,
"measurement_length": 2048,
"out_dir": "/tmp/exl2_work",
}
tokenize(job, save_fn, tokenizer, measure=False)
Dependencies
- torch -- tensor creation and manipulation
- pandas / fastparquet -- Parquet file reading for custom datasets
- safetensors.torch.save_file -- serialization of the
input_idstensor - ExLlamaV2Tokenizer -- model-specific text-to-token encoding