Principle:Shiyu coder Kronos Model Loading
| Field | Value |
|---|---|
| principle_name | Model_Loading |
| repo | Shiyu_coder_Kronos |
| domains | Deep_Learning, Time_Series, Autoregressive_Models |
| last_updated | 2026-02-09 14:00 GMT |
| implemented_by | Implementation:Shiyu_coder_Kronos_Kronos_From_Pretrained |
Summary
Loading a pre-trained autoregressive Transformer model that predicts discrete token sequences for financial time series forecasting.
Concept
The Kronos model is a decoder-only Transformer that operates on hierarchical discrete tokens produced by the KronosTokenizer. It predicts future tokens in an autoregressive manner: given a sequence of historical tokens, it generates the next token at each step.
The model uses a two-stage prediction approach:
- Stage 1 (s1): Predict the coarse token using the main Transformer output.
- Stage 2 (s2): Predict the fine token, conditioned on the sampled s1 token via a DependencyAwareLayer.
Loading a pre-trained Kronos model initializes all weights from a checkpoint that has been trained on large-scale financial time series data.
Theory
The Kronos architecture combines several specialized components:
- HierarchicalEmbedding: Fuses s1 (coarse) and s2 (fine) token embeddings into a single representation vector. This enables the model to attend to both levels of the token hierarchy simultaneously.
- TemporalEmbedding: Encodes timestamp features (minute, hour, weekday, day, month) into the model's hidden space, providing temporal context for financial data patterns (e.g., market hours, day-of-week effects).
- DualHead: A two-stage prediction head:
- First produces s1 logits (coarse token prediction).
- Then uses
cond_forwardto produce s2 logits conditioned on the s1 prediction.
- DependencyAwareLayer: Conditions the s2 prediction on the sampled s1 token embedding, ensuring that fine-grained predictions are consistent with the coarse-level structure.
The autoregressive factorization is:
P(tokens) = Product over t: P(s1_t | context) * P(s2_t | s1_t, context)
This hierarchical decomposition reduces the effective vocabulary size at each prediction step compared to predicting a single combined token.
Source
- Repository: Kronos on GitHub
- Decoder-only Transformer architecture inspired by GPT-style language models.
Domains
- Deep_Learning: Transformer-based neural architecture.
- Time_Series: Applied to sequential financial data forecasting.
- Autoregressive_Models: Token-by-token generation with conditional dependencies.
Related Principles
- Principle:Shiyu_coder_Kronos_Tokenizer_Loading - Loading the tokenizer that produces the tokens this model consumes.
- Principle:Shiyu_coder_Kronos_Predictor_Initialization - Wrapping the loaded model and tokenizer into a prediction interface.
- Principle:Shiyu_coder_Kronos_Autoregressive_Token_Generation - The generation loop that uses this model for inference.