Implementation:Ollama Ollama Llama Model BailingMoE2
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Model Architecture |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements the ggml computation graph builder for the second-generation BailingMoE2 model architecture.
Description
The llm_build_bailingmoe2 constructor builds a graph for the BailingMoE2 model which uses fused QKV projection (wqkv) split via tensor views, Q/K normalization with RMS norm, RoPE-based positional encoding, and MoE feed-forward layers. It separates transformer layers from next-token prediction layers based on hparams.nextn_predict_layers, supporting speculative decoding architectures.
Usage
Enables Ollama to run BailingMoE2 models through the llama.cpp inference engine, supporting the model's fused attention and speculative decoding architecture.
Code Reference
Source Location
- Repository: Ollama
- File:
llama/llama.cpp/src/models/bailingmoe2.cpp - Lines: 1-135
Signature
llm_build_bailingmoe2::llm_build_bailingmoe2(
const llama_model & model,
const llm_graph_params & params) : llm_graph_context(params);
Import
#include "models.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | Loaded model with BailingMoE2 weights |
| params | const llm_graph_params & | Yes | Graph construction parameters |
Outputs
| Name | Type | Description |
|---|---|---|
| ggml graph | ggml_cgraph | Complete BailingMoE2 computation graph with nextn prediction |
Usage Examples
auto builder = llm_build_bailingmoe2(model, params);