Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Shiyu coder Kronos Tokenizer Loading

From Leeroopedia


Field Value
principle_name Tokenizer_Loading
repo Shiyu_coder_Kronos
domains NLP, Time_Series, Quantization
last_updated 2026-02-09 14:00 GMT
implemented_by Implementation:Shiyu_coder_Kronos_KronosTokenizer_From_Pretrained

Summary

Loading a pre-trained VQ-VAE tokenizer that converts continuous financial candlestick data into hierarchical discrete tokens using Binary Spherical Quantization (BSQ).

Concept

The KronosTokenizer is a learned VQ-VAE model that transforms continuous multivariate financial time series (OHLCV candlestick data) into a compact sequence of discrete tokens. This tokenization bridges the gap between continuous price data and the discrete token sequences that autoregressive Transformer models can predict.

The tokenizer uses a two-level hierarchical token system:

  • s1 tokens (coarse level): Capture the broad structure of the price movement.
  • s2 tokens (fine level): Capture residual details conditioned on the coarse tokens.

Loading a pre-trained tokenizer ensures that the encoder, quantizer, and decoder weights are initialized from a checkpoint that has already been trained on financial data, providing robust discretization without requiring end-user training.

Theory

VQ-VAE (Vector Quantized Variational Autoencoder) tokenization with Binary Spherical Quantization (BSQ) uses a learned pipeline:

Input (continuous OHLCV) --> Encoder (Transformer blocks) --> Quantizer (BSQ) --> Decoder (Transformer blocks) --> Reconstruction

The BSQ approach maps continuous encoder outputs to binary codes on a unit hypersphere, producing discrete indices. The two-level hierarchy (s1 + s2) enables the autoregressive model to predict coarse structure first, then refine with fine details.

Key aspects of BSQ:

  • Binary codes are scaled to the [-1, 1] hypersphere surface.
  • The codebook dimension equals s1_bits + s2_bits, so the total vocabulary size is 2^(s1_bits + s2_bits).
  • Parameters beta, gamma0, gamma, and zeta control the quantization loss terms and commitment costs.

Loading from a pre-trained checkpoint uses the PyTorchModelHubMixin interface from HuggingFace Hub, which handles downloading model weights and configuration from a Hub repository or local path.

Source

  • Paper reference: "Language Models are Unsupervised Multitask Learners" for general tokenization concepts.
  • BSQ approach for binary spherical quantization of continuous signals.
  • Repository: Kronos on GitHub

Domains

  • NLP: Tokenization concepts adapted from language modeling.
  • Time_Series: Applied to multivariate financial time series data.
  • Quantization: Binary Spherical Quantization for discrete codebook construction.

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment