Principle:Shiyu coder Kronos Autoregressive Token Generation
| Field | Value |
|---|---|
| principle_name | Autoregressive_Token_Generation |
| repo | Shiyu_coder_Kronos |
| domains | Autoregressive_Models, Token_Generation, Sampling |
| last_updated | 2026-02-09 14:00 GMT |
| implemented_by | Implementation:Shiyu_coder_Kronos_Auto_Regressive_Inference |
Summary
Step-by-step generation of hierarchical discrete tokens (s1 coarse, s2 fine) using a sliding context window, temperature-controlled sampling, and multi-sample averaging.
Concept
Autoregressive token generation is the core inference mechanism of the Kronos system. Given an initial sequence of encoded tokens representing historical financial data, the model generates future tokens one at a time. Each newly generated token is appended to the context, and the process repeats until the desired prediction length is reached.
The generation operates on hierarchical tokens: at each timestep, a coarse (s1) token is generated first, then a fine (s2) token is generated conditioned on the s1 prediction. This two-stage process enables the model to first capture broad price movement patterns and then refine the details.
Theory
Hierarchical Two-Stage Sampling
At each generation step t, the process follows this sequence:
1. Run Transformer on current token buffer -> get context representation
2. Extract s1 logits from DualHead for position t
3. Apply temperature scaling: logits_s1 = logits_s1 / T
4. Apply top-k or top-p (nucleus) filtering
5. Sample s1 token from filtered distribution
6. Condition s2 prediction on sampled s1 via DependencyAwareLayer
7. Extract s2 logits from DualHead.cond_forward()
8. Apply temperature scaling and filtering to s2 logits
9. Sample s2 token
10. Append (s1, s2) to the token buffer
Sliding Context Window
The Transformer has a fixed maximum context length (max_context). When the token sequence exceeds this limit, a sliding window is used:
- A fixed-size buffer of length
max_contextis maintained. - When the buffer is full, tokens are shifted left (oldest token dropped) and the new token is appended at the end.
- The corresponding temporal features are also windowed to match.
This ensures constant memory usage regardless of prediction length, while still providing the model with the most recent context.
Temperature-Controlled Sampling
The temperature parameter T controls the entropy of the sampling distribution:
- T < 1.0: Sharper distribution, more deterministic predictions.
- T = 1.0: Unmodified model probabilities.
- T > 1.0: Flatter distribution, more diverse/random predictions.
Top-k and Top-p (Nucleus) Filtering
- Top-k: Only the k most probable tokens are considered. All others have their probability set to zero.
- Top-p (nucleus): The smallest set of tokens whose cumulative probability exceeds p is kept. This dynamically adjusts the number of candidates based on the distribution shape.
Multi-Sample Averaging
The input sequence is replicated sample_count times along the batch dimension. Each replica independently samples through the generation loop (with different random outcomes due to stochastic sampling). After generation, the decoded continuous values are averaged across samples:
final_prediction = mean(sample_1, sample_2, ..., sample_N)
This reduces the variance inherent in stochastic token sampling and produces more stable predictions.
Source
- Nucleus sampling: Holtzman et al., "The Curious Case of Neural Text Degeneration" (http://arxiv.org/abs/1904.09751)
- Repository: Kronos on GitHub
Domains
- Autoregressive_Models: Sequential token-by-token generation.
- Token_Generation: Discrete token sampling from predicted distributions.
- Sampling: Temperature, top-k, and nucleus sampling strategies.
Related Principles
- Principle:Shiyu_coder_Kronos_Model_Loading - The Transformer model used for token prediction.
- Principle:Shiyu_coder_Kronos_Tokenizer_Encoding - Encoding the initial context into tokens.
- Principle:Shiyu_coder_Kronos_Single_Series_Forecasting - The full pipeline that invokes this generation loop.