Implementation:Ollama Ollama Llama Model Granite Hybrid
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Model Architecture |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements the ggml computation graph builder for the IBM Granite Hybrid architecture, which combines attention and Mamba2 state-space model layers.
Description
The llm_build_granite_hybrid constructor extends llm_graph_context_mamba and builds a graph combining transformer attention layers with Mamba2 recurrent layers. Uses hparams.is_recurrent(il) to decide per-layer whether to apply self-attention (with optional RoPE) or SSM processing via build_mamba2_layer. Also applies Granite-specific logit and embedding scaling. Includes a helper method build_attention_layer for the attention sub-graph and uses hybrid memory for managing both KV cache and recurrent state.
Usage
Enables Ollama to run IBM Granite Hybrid models that mix attention with state-space model layers for efficient long-context inference.
Code Reference
Source Location
- Repository: Ollama
- File:
llama/llama.cpp/src/models/granite-hybrid.cpp - Lines: 1-196
Signature
llm_build_granite_hybrid::llm_build_granite_hybrid(
const llama_model & model,
const llm_graph_params & params) : llm_graph_context_mamba(params);
// Private helper:
ggml_tensor * build_attention_layer(
ggml_tensor * cur, ggml_tensor * inp_pos,
llm_graph_input_attn * inp_attn,
const llama_model & model, int64_t n_embd_head, int il);
Import
#include "models.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | Loaded model with Granite Hybrid weights |
| params | const llm_graph_params & | Yes | Graph construction parameters with hybrid memory context |
Outputs
| Name | Type | Description |
|---|---|---|
| ggml graph | ggml_cgraph | Hybrid computation graph mixing attention and SSM layers |
Usage Examples
auto builder = llm_build_granite_hybrid(model, params);