Implementation:Mlc ai Mlc llm GPT BigCode Model
| Knowledge Sources | |
|---|---|
| Domains | Model_Architecture, LLM |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Implements the GPTBigCode architecture for causal language modeling within the MLC LLM framework, featuring multi-query attention and learned positional embeddings.
Description
This module provides the TVM Relax-based implementation of the GPTBigCode model architecture (used in models like StarCoder). Unlike most modern transformer architectures in MLC LLM, GPTBigCode has several distinctive characteristics:
- Multi-query attention (MQA): Uses a single key-value head (
num_kv_heads = 1) shared across all query heads, significantly reducing KV cache memory requirements. - Learned positional embeddings: Uses a separate
nn.Embeddinglayer (wpe) for position embeddings instead of rotary position embeddings (RoPE). The KV cache is created withRopeMode.NONE. - LayerNorm: Uses standard LayerNorm instead of RMSNorm that is common in newer architectures.
- Biased projections: All linear layers (attention and MLP) include bias terms.
- GeLU activation: The MLP uses standard GeLU activation without gating.
- No tensor parallelism for attention: Explicitly asserts that
tensor_parallel_shards == 1for attention due to the single KV head design.
The model consists of GPTBigCodeModel (embedding + transformer blocks + final LayerNorm), wrapped by GPTBigCodeForCausalLM which adds the language modeling head and inference methods.
Usage
Use this module when compiling GPTBigCode-family models (e.g., StarCoder, SantaCoder) for deployment with MLC LLM. The model is identified by the gpt_bigcode model type in configuration files.
Code Reference
Source Location
- Repository: Mlc_ai_Mlc_llm
- File: python/mlc_llm/model/gpt_bigcode/gpt_bigcode_model.py
Signature
@dataclasses.dataclass
class GPTBigCodeConfig(ConfigBase):
n_embd: int
n_inner: int
n_head: int
n_layer: int
n_positions: int
layer_norm_epsilon: float
vocab_size: int
context_window_size: int = 0
prefill_chunk_size: int = 0
tensor_parallel_shards: int = 1
max_batch_size: int = 1
...
class GPTBigCodeForCausalLM(nn.Module):
def __init__(self, config: GPTBigCodeConfig): ...
def embed(self, input_ids: Tensor): ...
def prefill(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
def decode(self, input_embed: Tensor, paged_kv_cache: PagedKVCache): ...
def batch_prefill(self, input_embeds, logit_positions, paged_kv_cache): ...
def batch_decode(self, input_embeds, paged_kv_cache): ...
def batch_verify(self, input_embeds, paged_kv_cache): ...
def create_paged_kv_cache(self, ...): ...
def get_default_spec(self): ...
Import
from mlc_llm.model.gpt_bigcode.gpt_bigcode_model import GPTBigCodeConfig, GPTBigCodeForCausalLM
I/O Contract
Primary Classes
| Class | Role | Key Characteristics |
|---|---|---|
| GPTBigCodeConfig | Model configuration | Uses GPT-2 style naming (n_embd, n_inner, n_head, n_layer, n_positions) |
| GPTBigCodeMLP | Feed-forward network | Standard GeLU activation, c_fc/c_proj with bias |
| GPTBigCodeAttention | Multi-query attention | Single KV head, fused QKV projection (c_attn), biased projections |
| GPTBigCodeBlock | Transformer block | Pre-norm with LayerNorm (ln_1, ln_2) |
| GPTBigCodeModel | Core model | Token embeddings (wte) + position embeddings (wpe) + blocks + final LayerNorm |
| GPTBigCodeForCausalLM | Top-level model | Adds lm_head for causal LM |
Forward Methods
| Method | Input | Output |
|---|---|---|
embed |
Tensor[seq_len] (int32) | Tensor[1, seq_len, n_embd] |
prefill |
Tensor[1, seq_len, n_embd], PagedKVCache | (Tensor[1, 1, vocab_size], PagedKVCache) |
decode |
Tensor[1, 1, n_embd], PagedKVCache | (Tensor[1, 1, vocab_size], PagedKVCache) |
batch_prefill |
Tensor[1, seq_len, n_embd], Tensor[batch_size], PagedKVCache | (Tensor, PagedKVCache) |
batch_decode |
Tensor[batch_size, 1, n_embd], PagedKVCache | (Tensor, PagedKVCache) |
Configuration Mapping
| GPTBigCode Field | Standard Equivalent | Description |
|---|---|---|
n_embd |
hidden_size | Embedding dimension |
n_inner |
intermediate_size | MLP hidden dimension |
n_head |
num_attention_heads | Number of query heads |
n_layer |
num_hidden_layers | Number of transformer blocks |
n_positions |
max_position_embeddings | Maximum sequence length for learned position embeddings |
Usage Examples
# Creating a GPTBigCode config
config = GPTBigCodeConfig(
n_embd=2048,
n_inner=8192,
n_head=16,
n_layer=24,
n_positions=2048,
layer_norm_epsilon=1e-5,
vocab_size=49280,
)
model = GPTBigCodeForCausalLM(config)