Principle:Shiyu coder Kronos Tokenizer Loading
| Field | Value |
|---|---|
| principle_name | Tokenizer_Loading |
| repo | Shiyu_coder_Kronos |
| domains | NLP, Time_Series, Quantization |
| last_updated | 2026-02-09 14:00 GMT |
| implemented_by | Implementation:Shiyu_coder_Kronos_KronosTokenizer_From_Pretrained |
Summary
Loading a pre-trained VQ-VAE tokenizer that converts continuous financial candlestick data into hierarchical discrete tokens using Binary Spherical Quantization (BSQ).
Concept
The KronosTokenizer is a learned VQ-VAE model that transforms continuous multivariate financial time series (OHLCV candlestick data) into a compact sequence of discrete tokens. This tokenization bridges the gap between continuous price data and the discrete token sequences that autoregressive Transformer models can predict.
The tokenizer uses a two-level hierarchical token system:
- s1 tokens (coarse level): Capture the broad structure of the price movement.
- s2 tokens (fine level): Capture residual details conditioned on the coarse tokens.
Loading a pre-trained tokenizer ensures that the encoder, quantizer, and decoder weights are initialized from a checkpoint that has already been trained on financial data, providing robust discretization without requiring end-user training.
Theory
VQ-VAE (Vector Quantized Variational Autoencoder) tokenization with Binary Spherical Quantization (BSQ) uses a learned pipeline:
Input (continuous OHLCV) --> Encoder (Transformer blocks) --> Quantizer (BSQ) --> Decoder (Transformer blocks) --> Reconstruction
The BSQ approach maps continuous encoder outputs to binary codes on a unit hypersphere, producing discrete indices. The two-level hierarchy (s1 + s2) enables the autoregressive model to predict coarse structure first, then refine with fine details.
Key aspects of BSQ:
- Binary codes are scaled to the [-1, 1] hypersphere surface.
- The codebook dimension equals s1_bits + s2_bits, so the total vocabulary size is
2^(s1_bits + s2_bits). - Parameters beta, gamma0, gamma, and zeta control the quantization loss terms and commitment costs.
Loading from a pre-trained checkpoint uses the PyTorchModelHubMixin interface from HuggingFace Hub, which handles downloading model weights and configuration from a Hub repository or local path.
Source
- Paper reference: "Language Models are Unsupervised Multitask Learners" for general tokenization concepts.
- BSQ approach for binary spherical quantization of continuous signals.
- Repository: Kronos on GitHub
Domains
- NLP: Tokenization concepts adapted from language modeling.
- Time_Series: Applied to multivariate financial time series data.
- Quantization: Binary Spherical Quantization for discrete codebook construction.
Related Principles
- Principle:Shiyu_coder_Kronos_Tokenizer_Encoding - The encoding operation performed by the loaded tokenizer.
- Principle:Shiyu_coder_Kronos_Model_Loading - Loading the autoregressive model that consumes tokens produced by this tokenizer.