Implementation:Mlc ai Mlc llm Orion Model

Knowledge Sources	Mlc_ai_Mlc_llm
Domains	Machine Learning, Large Language Models, Model Architecture
Last Updated	2026-02-09 19:00 GMT

Overview

Implements the Orion-14B transformer-based large language model architecture for deployment through the MLC-LLM compilation pipeline using TVM Relax.

Description

This module defines the complete Orion-14B model architecture, including configuration management, attention mechanism with grouped-query attention (GQA), feed-forward network (FFN) with SiLU-gated linear units, and the full decoder-only causal language model. The implementation is built on top of TVM's Relax frontend neural network API and supports PagedKVCache for efficient inference, tensor parallelism for multi-GPU deployment, and rotary position embeddings (RoPE).

The module contains the following key classes:

OrionConfig -- A dataclass-based configuration that reads model hyperparameters such as hidden size, number of attention heads, number of hidden layers, and RoPE theta. It handles fallback logic for context window size and prefill chunk size.
OrionFFN -- A feed-forward network using a fused gate-up projection with SiLU activation, followed by a down projection. It supports tensor-parallel splitting of the intermediate dimension.
OrionAttention -- Multi-head attention with grouped-query attention (GQA) support via fused QKV projection and PagedKVCache-based attention computation.
OrionDecoderLayer -- A single transformer decoder layer combining LayerNorm, self-attention, and FFN with residual connections and optional all-reduce for tensor parallelism.
OrionModel -- The transformer backbone consisting of an embedding layer, a stack of decoder layers, and a final LayerNorm.
OrionForCasualLM -- The top-level causal language model that wraps OrionModel with a linear LM head and provides the standard inference interface (embed, prefill, decode, batch_prefill, batch_decode, batch_verify).

Usage

Use this module when compiling and deploying Orion-14B models through the MLC-LLM framework. It is automatically selected by the model loading pipeline when the model configuration matches the Orion architecture. The module supports both single-GPU and multi-GPU tensor-parallel inference with paged KV cache management.

Code Reference

Source Location

Repository: Mlc_ai_Mlc_llm
File: python/mlc_llm/model/orion/orion_model.py

Signature

@dataclasses.dataclass
class OrionConfig(ConfigBase):
    hidden_size: int
    intermediate_size: int
    num_attention_heads: int
    num_hidden_layers: int
    rms_norm_eps: float
    vocab_size: int
    position_embedding_base: int = 0
    context_window_size: int = 0
    prefill_chunk_size: int = 0
    num_key_value_heads: int = 0
    head_dim: int = 0
    tensor_parallel_shards: int = 1
    max_batch_size: int = 1
    ...

class OrionFFN(nn.Module):
    def __init__(self, config: OrionConfig): ...
    def forward(self, x: Tensor): ...

class OrionAttention(nn.Module):
    def __init__(self, config: OrionConfig): ...
    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...

class OrionDecoderLayer(nn.Module):
    def __init__(self, config: OrionConfig): ...
    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...

class OrionModel(nn.Module):
    def __init__(self, config: OrionConfig): ...
    def forward(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...

class OrionForCasualLM(nn.Module):
    def __init__(self, config: OrionConfig): ...
    def embed(self, input_ids: Tensor): ...
    def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def batch_forward(self, input_embeds: Tensor, paged_kv_cache: PagedKVCache, logit_positions: Optional[Tensor] = None): ...
    def batch_prefill(self, input_embeds: Tensor, logit_positions: Tensor, paged_kv_cache: PagedKVCache): ...
    def batch_decode(self, input_embeds: Tensor, paged_kv_cache: PagedKVCache): ...
    def batch_verify(self, input_embeds: Tensor, paged_kv_cache: PagedKVCache): ...
    def create_paged_kv_cache(self, max_batch_size, max_total_seq_len, prefill_chunk_size, page_size, support_sliding_window) -> PagedKVCache: ...
    def get_default_spec(self): ...

Import

from mlc_llm.model.orion.orion_model import OrionConfig, OrionForCasualLM

I/O Contract

Method	Input	Output	Description
embed	input_ids: Tensor[seq_len] (int32)	Tensor[1, seq_len, hidden_size]	Converts token IDs to embeddings; broadcasts in tensor-parallel mode
prefill	input_embed: Tensor[1, seq_len, hidden_size], paged_kv_cache: PagedKVCache	(logits: Tensor[1, 1, vocab_size], paged_kv_cache)	Processes a full prompt, returns logits for the last token
decode	input_embed: Tensor[1, 1, hidden_size], paged_kv_cache: PagedKVCache	(logits: Tensor[1, 1, vocab_size], paged_kv_cache)	Decodes one token at a time using cached KV state
batch_prefill	input_embeds: Tensor[1, seq_len, hidden_size], logit_positions: Tensor[batch_size], paged_kv_cache	(logits, paged_kv_cache)	Batched prefill with selective logit extraction
batch_decode	input_embeds: Tensor[batch_size, 1, hidden_size], paged_kv_cache	(logits, paged_kv_cache)	Batched single-token decoding
batch_verify	input_embeds: Tensor[1, seq_len, hidden_size], paged_kv_cache	(logits, paged_kv_cache)	Batched speculative verification

Configuration Field	Type	Description
hidden_size	int	Dimensionality of the hidden representations
intermediate_size	int	Dimensionality of the FFN intermediate layer
num_attention_heads	int	Number of query attention heads
num_hidden_layers	int	Number of transformer decoder layers
rms_norm_eps	float	Epsilon for RMS normalization (used via LayerNorm)
vocab_size	int	Size of the token vocabulary
num_key_value_heads	int	Number of key/value heads for GQA (defaults to num_attention_heads)
position_embedding_base	int	Base frequency for RoPE (defaults to 10000)
tensor_parallel_shards	int	Number of GPUs for tensor parallelism

Usage Examples

# Instantiate the Orion model configuration
config = OrionConfig(
    hidden_size=5120,
    intermediate_size=15360,
    num_attention_heads=40,
    num_hidden_layers=40,
    rms_norm_eps=1e-5,
    vocab_size=84608,
    num_key_value_heads=40,
    context_window_size=4096,
)

# Create the causal LM model
model = OrionForCasualLM(config)

# Convert to a specific dtype
model.to("float16")

# Get the default module specification for compilation
spec = model.get_default_spec()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment