Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 Tokenize Calibration

From Leeroopedia
Revision as of 14:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Turboderp_org_Exllamav2_Tokenize_Calibration.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Quantization, NLP, Data_Preprocessing
Last Updated 2026-02-15 00:00 GMT

Overview

Concrete tool for tokenizing calibration text into a fixed-shape tensor of token IDs provided by exllamav2.

Description

The tokenize function reads a calibration dataset (either a user-supplied Parquet file or the built-in multi-domain standard calibration set), encodes the text using the model's tokenizer, and saves the resulting input_ids tensor as a safetensors file. When using the standard calibration set, it blends text from C4, Wikipedia, code, TinyStories, multilingual, technical, and random-token sources into a single matrix. Two tokenization passes are supported: one for measurement (fewer rows, shorter context) and one for the full quantization pass.

Usage

Call tokenize as the first step of the EXL2 conversion pipeline, once with measure=True to generate measurement-phase calibration data, and once with measure=False to generate quantization-phase calibration data.

Code Reference

Source Location

  • Repository: exllamav2
  • File: exllamav2/conversion/tokenize.py
  • Lines: L39-62 (main tokenize function), L8-36 (get_tokens helper), L64-207 (get_standard_calibration)

Signature

def tokenize(job, save_fn, tokenizer, measure=False, noise_rows=None):

Supporting Functions

def get_tokens(num_rows, length, filename, tokenizer):
    """Read a Parquet file, tokenize all text, reshape into (num_rows, length)."""

def get_standard_calibration(job, measure, tokenizer, noise_rows=None):
    """Build a multi-domain calibration tensor from built-in UTF-8 text files."""

Import

from exllamav2.conversion.tokenize import tokenize

I/O Contract

Inputs

Name Type Required Description
job dict Yes Conversion job state dictionary. Key fields: cal_dataset (path to Parquet file or None for default), dataset_rows (int), measurement_rows (int), length (int), measurement_length (int), out_dir (str)
save_fn callable Yes Callback to persist job state (called externally, not within tokenize itself)
tokenizer ExLlamaV2Tokenizer Yes The model's tokenizer instance used to encode text into token IDs
measure bool No If True, use measurement_rows and measurement_length; if False (default), use dataset_rows and length
noise_rows tuple(int, int) or None No Optional pair (measure_noise, quant_noise) specifying number of noise rows to append

Outputs

Name Type Description
cal_data.safetensors File (safetensors) Saved to job["out_dir"]/cal_data.safetensors containing tensor input_ids of shape (num_rows, length) with dtype torch.long
job["cal_filename"] str (side effect) The path to the saved calibration file is written back into the job dict

Standard Calibration Composition

When job["cal_dataset"] is None, the function assembles rows from built-in files:

Source File Rows (measure) Rows (quantize) Notes
C4 c4.utf8 2 10 General web text, contiguous tokenization
Wikipedia wiki.utf8 4 48 Half without BOS, half with BOS prefix
Code code.utf8 3 15 Programming language samples
TinyStories tiny.utf8 2 10 Simple stories, half without BOS, half with BOS
Multilingual multilingual.utf8 3 15 Contiguous multilingual text
Multilingual (shuffled) multilingual.utf8 1 5 128-token segments randomly shuffled
Random tokens (generated) 2 2 Uniform random token IDs from full vocabulary
Technical technical.utf8 2 10 Scientific and mathematical text
Noise (generated) 0 0 Optional negative-ones rows if noise_rows is set

Usage Examples

Basic Example

from exllamav2 import ExLlamaV2Tokenizer
from exllamav2.conversion.tokenize import tokenize

# Assume job dict and tokenizer are already configured
job = {
    "cal_dataset": None,  # Use built-in standard calibration
    "dataset_rows": 100,
    "measurement_rows": 16,
    "length": 2048,
    "measurement_length": 2048,
    "out_dir": "/tmp/exl2_work",
}

def save_fn():
    pass  # Persist job state to disk

# Tokenize for measurement phase
tokenize(job, save_fn, tokenizer, measure=True)

# Tokenize for quantization phase
tokenize(job, save_fn, tokenizer, measure=False)

# Result: job["cal_filename"] points to the saved safetensors file
print(job["cal_filename"])
# /tmp/exl2_work/cal_data.safetensors

Custom Dataset Example

job = {
    "cal_dataset": "/data/my_calibration.parquet",
    "dataset_rows": 200,
    "measurement_rows": 32,
    "length": 4096,
    "measurement_length": 2048,
    "out_dir": "/tmp/exl2_work",
}

tokenize(job, save_fn, tokenizer, measure=False)

Dependencies

  • torch -- tensor creation and manipulation
  • pandas / fastparquet -- Parquet file reading for custom datasets
  • safetensors.torch.save_file -- serialization of the input_ids tensor
  • ExLlamaV2Tokenizer -- model-specific text-to-token encoding

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment