Implementation:Mlc ai Mlc llm Orion Model
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Large Language Models, Model Architecture |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Implements the Orion-14B transformer-based large language model architecture for deployment through the MLC-LLM compilation pipeline using TVM Relax.
Description
This module defines the complete Orion-14B model architecture, including configuration management, attention mechanism with grouped-query attention (GQA), feed-forward network (FFN) with SiLU-gated linear units, and the full decoder-only causal language model. The implementation is built on top of TVM's Relax frontend neural network API and supports PagedKVCache for efficient inference, tensor parallelism for multi-GPU deployment, and rotary position embeddings (RoPE).
The module contains the following key classes:
- OrionConfig -- A dataclass-based configuration that reads model hyperparameters such as hidden size, number of attention heads, number of hidden layers, and RoPE theta. It handles fallback logic for context window size and prefill chunk size.
- OrionFFN -- A feed-forward network using a fused gate-up projection with SiLU activation, followed by a down projection. It supports tensor-parallel splitting of the intermediate dimension.
- OrionAttention -- Multi-head attention with grouped-query attention (GQA) support via fused QKV projection and PagedKVCache-based attention computation.
- OrionDecoderLayer -- A single transformer decoder layer combining LayerNorm, self-attention, and FFN with residual connections and optional all-reduce for tensor parallelism.
- OrionModel -- The transformer backbone consisting of an embedding layer, a stack of decoder layers, and a final LayerNorm.
- OrionForCasualLM -- The top-level causal language model that wraps OrionModel with a linear LM head and provides the standard inference interface (embed, prefill, decode, batch_prefill, batch_decode, batch_verify).
Usage
Use this module when compiling and deploying Orion-14B models through the MLC-LLM framework. It is automatically selected by the model loading pipeline when the model configuration matches the Orion architecture. The module supports both single-GPU and multi-GPU tensor-parallel inference with paged KV cache management.
Code Reference
Source Location
- Repository: Mlc_ai_Mlc_llm
- File: python/mlc_llm/model/orion/orion_model.py
Signature
@dataclasses.dataclass
class OrionConfig(ConfigBase):
hidden_size: int
intermediate_size: int
num_attention_heads: int
num_hidden_layers: int
rms_norm_eps: float
vocab_size: int
position_embedding_base: int = 0
context_window_size: int = 0
prefill_chunk_size: int = 0
num_key_value_heads: int = 0
head_dim: int = 0
tensor_parallel_shards: int = 1
max_batch_size: int = 1
...
class OrionFFN(nn.Module):
def __init__(self, config: OrionConfig): ...
def forward(self, x: Tensor): ...
class OrionAttention(nn.Module):
def __init__(self, config: OrionConfig): ...
def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...
class OrionDecoderLayer(nn.Module):
def __init__(self, config: OrionConfig): ...
def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...
class OrionModel(nn.Module):
def __init__(self, config: OrionConfig): ...
def forward(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
class OrionForCasualLM(nn.Module):
def __init__(self, config: OrionConfig): ...
def embed(self, input_ids: Tensor): ...
def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
def batch_forward(self, input_embeds: Tensor, paged_kv_cache: PagedKVCache, logit_positions: Optional[Tensor] = None): ...
def batch_prefill(self, input_embeds: Tensor, logit_positions: Tensor, paged_kv_cache: PagedKVCache): ...
def batch_decode(self, input_embeds: Tensor, paged_kv_cache: PagedKVCache): ...
def batch_verify(self, input_embeds: Tensor, paged_kv_cache: PagedKVCache): ...
def create_paged_kv_cache(self, max_batch_size, max_total_seq_len, prefill_chunk_size, page_size, support_sliding_window) -> PagedKVCache: ...
def get_default_spec(self): ...
Import
from mlc_llm.model.orion.orion_model import OrionConfig, OrionForCasualLM
I/O Contract
| Method | Input | Output | Description |
|---|---|---|---|
| embed | input_ids: Tensor[seq_len] (int32) | Tensor[1, seq_len, hidden_size] | Converts token IDs to embeddings; broadcasts in tensor-parallel mode |
| prefill | input_embed: Tensor[1, seq_len, hidden_size], paged_kv_cache: PagedKVCache | (logits: Tensor[1, 1, vocab_size], paged_kv_cache) | Processes a full prompt, returns logits for the last token |
| decode | input_embed: Tensor[1, 1, hidden_size], paged_kv_cache: PagedKVCache | (logits: Tensor[1, 1, vocab_size], paged_kv_cache) | Decodes one token at a time using cached KV state |
| batch_prefill | input_embeds: Tensor[1, seq_len, hidden_size], logit_positions: Tensor[batch_size], paged_kv_cache | (logits, paged_kv_cache) | Batched prefill with selective logit extraction |
| batch_decode | input_embeds: Tensor[batch_size, 1, hidden_size], paged_kv_cache | (logits, paged_kv_cache) | Batched single-token decoding |
| batch_verify | input_embeds: Tensor[1, seq_len, hidden_size], paged_kv_cache | (logits, paged_kv_cache) | Batched speculative verification |
| Configuration Field | Type | Description |
|---|---|---|
| hidden_size | int | Dimensionality of the hidden representations |
| intermediate_size | int | Dimensionality of the FFN intermediate layer |
| num_attention_heads | int | Number of query attention heads |
| num_hidden_layers | int | Number of transformer decoder layers |
| rms_norm_eps | float | Epsilon for RMS normalization (used via LayerNorm) |
| vocab_size | int | Size of the token vocabulary |
| num_key_value_heads | int | Number of key/value heads for GQA (defaults to num_attention_heads) |
| position_embedding_base | int | Base frequency for RoPE (defaults to 10000) |
| tensor_parallel_shards | int | Number of GPUs for tensor parallelism |
Usage Examples
# Instantiate the Orion model configuration
config = OrionConfig(
hidden_size=5120,
intermediate_size=15360,
num_attention_heads=40,
num_hidden_layers=40,
rms_norm_eps=1e-5,
vocab_size=84608,
num_key_value_heads=40,
context_window_size=4096,
)
# Create the causal LM model
model = OrionForCasualLM(config)
# Convert to a specific dtype
model.to("float16")
# Get the default module specification for compilation
spec = model.get_default_spec()