Implementation:Turboderp org Exllamav2 Tokenize Calibration

Knowledge Sources	ExLlamaV2
Domains	Quantization, NLP, Data_Preprocessing
Last Updated	2026-02-15 00:00 GMT

Overview

Concrete tool for tokenizing calibration text into a fixed-shape tensor of token IDs provided by exllamav2.

Description

The tokenize function reads a calibration dataset (either a user-supplied Parquet file or the built-in multi-domain standard calibration set), encodes the text using the model's tokenizer, and saves the resulting input_ids tensor as a safetensors file. When using the standard calibration set, it blends text from C4, Wikipedia, code, TinyStories, multilingual, technical, and random-token sources into a single matrix. Two tokenization passes are supported: one for measurement (fewer rows, shorter context) and one for the full quantization pass.

Usage

Call tokenize as the first step of the EXL2 conversion pipeline, once with measure=True to generate measurement-phase calibration data, and once with measure=False to generate quantization-phase calibration data.

Code Reference

Source Location

Repository: exllamav2
File: exllamav2/conversion/tokenize.py
Lines: L39-62 (main tokenize function), L8-36 (get_tokens helper), L64-207 (get_standard_calibration)

Signature

def tokenize(job, save_fn, tokenizer, measure=False, noise_rows=None):

Supporting Functions

def get_tokens(num_rows, length, filename, tokenizer):
    """Read a Parquet file, tokenize all text, reshape into (num_rows, length)."""

def get_standard_calibration(job, measure, tokenizer, noise_rows=None):
    """Build a multi-domain calibration tensor from built-in UTF-8 text files."""

Import

from exllamav2.conversion.tokenize import tokenize

I/O Contract

Inputs

Name	Type	Required	Description
job	dict	Yes	Conversion job state dictionary. Key fields: `cal_dataset` (path to Parquet file or None for default), `dataset_rows` (int), `measurement_rows` (int), `length` (int), `measurement_length` (int), `out_dir` (str)
save_fn	callable	Yes	Callback to persist job state (called externally, not within tokenize itself)
tokenizer	ExLlamaV2Tokenizer	Yes	The model's tokenizer instance used to encode text into token IDs
measure	bool	No	If `True`, use `measurement_rows` and `measurement_length`; if `False` (default), use `dataset_rows` and `length`
noise_rows	tuple(int, int) or None	No	Optional pair `(measure_noise, quant_noise)` specifying number of noise rows to append

Outputs

Name	Type	Description
cal_data.safetensors	File (safetensors)	Saved to `job["out_dir"]/cal_data.safetensors` containing tensor `input_ids` of shape `(num_rows, length)` with dtype `torch.long`
job["cal_filename"]	str (side effect)	The path to the saved calibration file is written back into the job dict

Standard Calibration Composition

When job["cal_dataset"] is None, the function assembles rows from built-in files:

Source	File	Rows (measure)	Rows (quantize)	Notes
C4	c4.utf8	2	10	General web text, contiguous tokenization
Wikipedia	wiki.utf8	4	48	Half without BOS, half with BOS prefix
Code	code.utf8	3	15	Programming language samples
TinyStories	tiny.utf8	2	10	Simple stories, half without BOS, half with BOS
Multilingual	multilingual.utf8	3	15	Contiguous multilingual text
Multilingual (shuffled)	multilingual.utf8	1	5	128-token segments randomly shuffled
Random tokens	(generated)	2	2	Uniform random token IDs from full vocabulary
Technical	technical.utf8	2	10	Scientific and mathematical text
Noise	(generated)	0	0	Optional negative-ones rows if `noise_rows` is set

Usage Examples

Basic Example

from exllamav2 import ExLlamaV2Tokenizer
from exllamav2.conversion.tokenize import tokenize

# Assume job dict and tokenizer are already configured
job = {
    "cal_dataset": None,  # Use built-in standard calibration
    "dataset_rows": 100,
    "measurement_rows": 16,
    "length": 2048,
    "measurement_length": 2048,
    "out_dir": "/tmp/exl2_work",
}

def save_fn():
    pass  # Persist job state to disk

# Tokenize for measurement phase
tokenize(job, save_fn, tokenizer, measure=True)

# Tokenize for quantization phase
tokenize(job, save_fn, tokenizer, measure=False)

# Result: job["cal_filename"] points to the saved safetensors file
print(job["cal_filename"])
# /tmp/exl2_work/cal_data.safetensors

Custom Dataset Example

job = {
    "cal_dataset": "/data/my_calibration.parquet",
    "dataset_rows": 200,
    "measurement_rows": 32,
    "length": 4096,
    "measurement_length": 2048,
    "out_dir": "/tmp/exl2_work",
}

tokenize(job, save_fn, tokenizer, measure=False)

Dependencies

torch -- tensor creation and manipulation
pandas / fastparquet -- Parquet file reading for custom datasets
safetensors.torch.save_file -- serialization of the input_ids tensor
ExLlamaV2Tokenizer -- model-specific text-to-token encoding

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Calibration_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment