Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlc ai Mlc llm Orion Model

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Large Language Models, Model Architecture
Last Updated 2026-02-09 19:00 GMT

Overview

Implements the Orion-14B transformer-based large language model architecture for deployment through the MLC-LLM compilation pipeline using TVM Relax.

Description

This module defines the complete Orion-14B model architecture, including configuration management, attention mechanism with grouped-query attention (GQA), feed-forward network (FFN) with SiLU-gated linear units, and the full decoder-only causal language model. The implementation is built on top of TVM's Relax frontend neural network API and supports PagedKVCache for efficient inference, tensor parallelism for multi-GPU deployment, and rotary position embeddings (RoPE).

The module contains the following key classes:

  • OrionConfig -- A dataclass-based configuration that reads model hyperparameters such as hidden size, number of attention heads, number of hidden layers, and RoPE theta. It handles fallback logic for context window size and prefill chunk size.
  • OrionFFN -- A feed-forward network using a fused gate-up projection with SiLU activation, followed by a down projection. It supports tensor-parallel splitting of the intermediate dimension.
  • OrionAttention -- Multi-head attention with grouped-query attention (GQA) support via fused QKV projection and PagedKVCache-based attention computation.
  • OrionDecoderLayer -- A single transformer decoder layer combining LayerNorm, self-attention, and FFN with residual connections and optional all-reduce for tensor parallelism.
  • OrionModel -- The transformer backbone consisting of an embedding layer, a stack of decoder layers, and a final LayerNorm.
  • OrionForCasualLM -- The top-level causal language model that wraps OrionModel with a linear LM head and provides the standard inference interface (embed, prefill, decode, batch_prefill, batch_decode, batch_verify).

Usage

Use this module when compiling and deploying Orion-14B models through the MLC-LLM framework. It is automatically selected by the model loading pipeline when the model configuration matches the Orion architecture. The module supports both single-GPU and multi-GPU tensor-parallel inference with paged KV cache management.

Code Reference

Source Location

Signature

@dataclasses.dataclass
class OrionConfig(ConfigBase):
    hidden_size: int
    intermediate_size: int
    num_attention_heads: int
    num_hidden_layers: int
    rms_norm_eps: float
    vocab_size: int
    position_embedding_base: int = 0
    context_window_size: int = 0
    prefill_chunk_size: int = 0
    num_key_value_heads: int = 0
    head_dim: int = 0
    tensor_parallel_shards: int = 1
    max_batch_size: int = 1
    ...

class OrionFFN(nn.Module):
    def __init__(self, config: OrionConfig): ...
    def forward(self, x: Tensor): ...

class OrionAttention(nn.Module):
    def __init__(self, config: OrionConfig): ...
    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...

class OrionDecoderLayer(nn.Module):
    def __init__(self, config: OrionConfig): ...
    def forward(self, hidden_states: Tensor, paged_kv_cache: PagedKVCache, layer_id: int): ...

class OrionModel(nn.Module):
    def __init__(self, config: OrionConfig): ...
    def forward(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...

class OrionForCasualLM(nn.Module):
    def __init__(self, config: OrionConfig): ...
    def embed(self, input_ids: Tensor): ...
    def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def batch_forward(self, input_embeds: Tensor, paged_kv_cache: PagedKVCache, logit_positions: Optional[Tensor] = None): ...
    def batch_prefill(self, input_embeds: Tensor, logit_positions: Tensor, paged_kv_cache: PagedKVCache): ...
    def batch_decode(self, input_embeds: Tensor, paged_kv_cache: PagedKVCache): ...
    def batch_verify(self, input_embeds: Tensor, paged_kv_cache: PagedKVCache): ...
    def create_paged_kv_cache(self, max_batch_size, max_total_seq_len, prefill_chunk_size, page_size, support_sliding_window) -> PagedKVCache: ...
    def get_default_spec(self): ...

Import

from mlc_llm.model.orion.orion_model import OrionConfig, OrionForCasualLM

I/O Contract

Method Input Output Description
embed input_ids: Tensor[seq_len] (int32) Tensor[1, seq_len, hidden_size] Converts token IDs to embeddings; broadcasts in tensor-parallel mode
prefill input_embed: Tensor[1, seq_len, hidden_size], paged_kv_cache: PagedKVCache (logits: Tensor[1, 1, vocab_size], paged_kv_cache) Processes a full prompt, returns logits for the last token
decode input_embed: Tensor[1, 1, hidden_size], paged_kv_cache: PagedKVCache (logits: Tensor[1, 1, vocab_size], paged_kv_cache) Decodes one token at a time using cached KV state
batch_prefill input_embeds: Tensor[1, seq_len, hidden_size], logit_positions: Tensor[batch_size], paged_kv_cache (logits, paged_kv_cache) Batched prefill with selective logit extraction
batch_decode input_embeds: Tensor[batch_size, 1, hidden_size], paged_kv_cache (logits, paged_kv_cache) Batched single-token decoding
batch_verify input_embeds: Tensor[1, seq_len, hidden_size], paged_kv_cache (logits, paged_kv_cache) Batched speculative verification
Configuration Field Type Description
hidden_size int Dimensionality of the hidden representations
intermediate_size int Dimensionality of the FFN intermediate layer
num_attention_heads int Number of query attention heads
num_hidden_layers int Number of transformer decoder layers
rms_norm_eps float Epsilon for RMS normalization (used via LayerNorm)
vocab_size int Size of the token vocabulary
num_key_value_heads int Number of key/value heads for GQA (defaults to num_attention_heads)
position_embedding_base int Base frequency for RoPE (defaults to 10000)
tensor_parallel_shards int Number of GPUs for tensor parallelism

Usage Examples

# Instantiate the Orion model configuration
config = OrionConfig(
    hidden_size=5120,
    intermediate_size=15360,
    num_attention_heads=40,
    num_hidden_layers=40,
    rms_norm_eps=1e-5,
    vocab_size=84608,
    num_key_value_heads=40,
    context_window_size=4096,
)

# Create the causal LM model
model = OrionForCasualLM(config)

# Convert to a specific dtype
model.to("float16")

# Get the default module specification for compilation
spec = model.get_default_spec()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment