Principle:Shiyu coder Kronos Tokenizer Encoding
| Field | Value |
|---|---|
| principle_name | Tokenizer_Encoding |
| repo | Shiyu_coder_Kronos |
| domains | Quantization, Tokenization, Time_Series |
| last_updated | 2026-02-09 14:00 GMT |
| implemented_by | Implementation:Shiyu_coder_Kronos_KronosTokenizer_Encode |
Summary
Encoding continuous OHLCV financial data into hierarchical discrete token indices through learned encoder Transformer layers and Binary Spherical Quantization.
Concept
The encoding process transforms continuous multivariate financial time series data into discrete token indices that can be consumed by the autoregressive Kronos Transformer. This discretization is the fundamental bridge between continuous price data and the discrete sequence modeling paradigm.
The encoder operates on normalized OHLCV features (open, high, low, close, volume, amount) and produces either:
- A single combined index per timestep (when
half=False), representing the full codebook entry. - A pair of hierarchical indices (s1, s2) per timestep (when
half=True), representing coarse and fine quantization levels separately.
Theory
The encoding pipeline follows four stages:
Input x: (batch, seq_len, d_in)
|
v
Linear Embedding: nn.Linear(d_in -> d_model)
|
v
Encoder Transformer Blocks: (n_enc_layers - 1) TransformerBlock layers
|
v
Quantization Embedding: nn.Linear(d_model -> codebook_dim)
where codebook_dim = s1_bits + s2_bits
|
v
BSQuantizer: Binary Spherical Quantization
|
v
Output: z_indices (discrete token indices)
Binary Spherical Quantization (BSQ)
The BSQuantizer converts continuous vectors into binary codes on a hypersphere:
- The input vector (of dimension
codebook_dim = s1_bits + s2_bits) is quantized to binary values {-1, +1}. - Each bit position contributes to a binary code that indexes into an implicit codebook.
- The binary code is scaled by
1 / sqrt(codebook_dim)to project onto the unit hypersphere surface.
Hierarchical Indices (half=True)
When half=True, the BSQuantizer splits the codebook dimension in half:
- s1_indices: Index computed from the first
s1_bitsdimensions. Represents the coarse quantization. - s2_indices: Index computed from the last
s2_bitsdimensions. Represents the fine quantization.
The s1 vocabulary size is 2^s1_bits and the s2 vocabulary size is 2^s2_bits.
Combined Index (half=False)
When half=False, a single index is computed from all s1_bits + s2_bits dimensions, giving a vocabulary size of 2^(s1_bits + s2_bits).
Training vs Inference Use
- half=True is used during the autoregressive pipeline (both training and inference) because the Kronos model predicts s1 and s2 separately via the DualHead.
- half=False is used for evaluation or when a single flat codebook index is sufficient.
Source
- Repository: Kronos on GitHub
- Binary Spherical Quantization for learned discrete representations.
Domains
- Quantization: Binary spherical codebook for discrete representation.
- Tokenization: Converting continuous signals to discrete tokens.
- Time_Series: Applied to multivariate financial time series data.
Related Principles
- Principle:Shiyu_coder_Kronos_Tokenizer_Loading - Loading the tokenizer that performs this encoding.
- Principle:Shiyu_coder_Kronos_Autoregressive_Token_Generation - The generation loop that consumes the encoded tokens.
- Principle:Shiyu_coder_Kronos_Single_Series_Forecasting - The full pipeline that uses encoding as a preprocessing step.