Implementation:Mlc ai Mlc llm GPT BigCode Model

Knowledge Sources	Mlc_ai_Mlc_llm
Domains	Model_Architecture, LLM
Last Updated	2026-02-09 19:00 GMT

Overview

Implements the GPTBigCode architecture for causal language modeling within the MLC LLM framework, featuring multi-query attention and learned positional embeddings.

Description

This module provides the TVM Relax-based implementation of the GPTBigCode model architecture (used in models like StarCoder). Unlike most modern transformer architectures in MLC LLM, GPTBigCode has several distinctive characteristics:

Multi-query attention (MQA): Uses a single key-value head (num_kv_heads = 1) shared across all query heads, significantly reducing KV cache memory requirements.
Learned positional embeddings: Uses a separate nn.Embedding layer (wpe) for position embeddings instead of rotary position embeddings (RoPE). The KV cache is created with RopeMode.NONE.
LayerNorm: Uses standard LayerNorm instead of RMSNorm that is common in newer architectures.
Biased projections: All linear layers (attention and MLP) include bias terms.
GeLU activation: The MLP uses standard GeLU activation without gating.
No tensor parallelism for attention: Explicitly asserts that tensor_parallel_shards == 1 for attention due to the single KV head design.

The model consists of GPTBigCodeModel (embedding + transformer blocks + final LayerNorm), wrapped by GPTBigCodeForCausalLM which adds the language modeling head and inference methods.

Usage

Use this module when compiling GPTBigCode-family models (e.g., StarCoder, SantaCoder) for deployment with MLC LLM. The model is identified by the gpt_bigcode model type in configuration files.

Code Reference

Source Location

Repository: Mlc_ai_Mlc_llm
File: python/mlc_llm/model/gpt_bigcode/gpt_bigcode_model.py

Signature

@dataclasses.dataclass
class GPTBigCodeConfig(ConfigBase):
    n_embd: int
    n_inner: int
    n_head: int
    n_layer: int
    n_positions: int
    layer_norm_epsilon: float
    vocab_size: int
    context_window_size: int = 0
    prefill_chunk_size: int = 0
    tensor_parallel_shards: int = 1
    max_batch_size: int = 1
    ...

class GPTBigCodeForCausalLM(nn.Module):
    def __init__(self, config: GPTBigCodeConfig): ...
    def embed(self, input_ids: Tensor): ...
    def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
    def batch_prefill(self, input_embeds, logit_positions, paged_kv_cache): ...
    def batch_decode(self, input_embeds, paged_kv_cache): ...
    def batch_verify(self, input_embeds, paged_kv_cache): ...
    def create_paged_kv_cache(self, ...): ...
    def get_default_spec(self): ...

Import

from mlc_llm.model.gpt_bigcode.gpt_bigcode_model import GPTBigCodeConfig, GPTBigCodeForCausalLM

I/O Contract

Primary Classes

Class	Role	Key Characteristics
GPTBigCodeConfig	Model configuration	Uses GPT-2 style naming (n_embd, n_inner, n_head, n_layer, n_positions)
GPTBigCodeMLP	Feed-forward network	Standard GeLU activation, c_fc/c_proj with bias
GPTBigCodeAttention	Multi-query attention	Single KV head, fused QKV projection (c_attn), biased projections
GPTBigCodeBlock	Transformer block	Pre-norm with LayerNorm (ln_1, ln_2)
GPTBigCodeModel	Core model	Token embeddings (wte) + position embeddings (wpe) + blocks + final LayerNorm
GPTBigCodeForCausalLM	Top-level model	Adds lm_head for causal LM

Forward Methods

Method	Input	Output
`embed`	Tensor[seq_len] (int32)	Tensor[1, seq_len, n_embd]
`prefill`	Tensor[1, seq_len, n_embd], PagedKVCache	(Tensor[1, 1, vocab_size], PagedKVCache)
`decode`	Tensor[1, 1, n_embd], PagedKVCache	(Tensor[1, 1, vocab_size], PagedKVCache)
`batch_prefill`	Tensor[1, seq_len, n_embd], Tensor[batch_size], PagedKVCache	(Tensor, PagedKVCache)
`batch_decode`	Tensor[batch_size, 1, n_embd], PagedKVCache	(Tensor, PagedKVCache)

Configuration Mapping

GPTBigCode Field	Standard Equivalent	Description
`n_embd`	hidden_size	Embedding dimension
`n_inner`	intermediate_size	MLP hidden dimension
`n_head`	num_attention_heads	Number of query heads
`n_layer`	num_hidden_layers	Number of transformer blocks
`n_positions`	max_position_embeddings	Maximum sequence length for learned position embeddings

Usage Examples

# Creating a GPTBigCode config
config = GPTBigCodeConfig(
    n_embd=2048,
    n_inner=8192,
    n_head=16,
    n_layer=24,
    n_positions=2048,
    layer_norm_epsilon=1e-5,
    vocab_size=49280,
)
model = GPTBigCodeForCausalLM(config)

Related Pages

Implementation:Mlc_ai_Mlc_llm_Model_Preset

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment