Principle:Ggml org Llama cpp Multimodal Encoding And Generation

Aspect	Detail
Principle Name	Multimodal Encoding And Generation
Domain	Multimodal Inference
Scope	Joint multimodal encoding: tokenizing mixed text+media input, encoding media through projector, and generating text output
Related Workflow	Multimodal_Inference

Overview

Description

The final stage of the multimodal inference pipeline brings together text and media inputs into a unified sequence that the language model can process. This involves three sub-steps: tokenization (splitting mixed text+media prompts into chunks), encoding (running media through the projector to produce embeddings), and evaluation (feeding the combined sequence of text tokens and media embeddings through the language model for autoregressive generation).

Usage

After the multimodal context has been initialized and bitmaps have been prepared, the user:

Calls mtmd_tokenize() to split the prompt into an ordered list of text and media chunks
Calls mtmd_encode_chunk() on each media chunk to produce embeddings (or uses the helper mtmd_helper_eval_chunks() to automate the entire evaluation pipeline)
Reads generated logits from the language model context for sampling

Theoretical Basis

Multimodal Tokenization:

Standard text tokenization converts a string into a sequence of token IDs. Multimodal tokenization extends this by identifying media marker positions in the input text (default marker: <__media__>) and replacing each marker with a placeholder for the corresponding media input. The result is an ordered sequence of chunks, where each chunk is either:

A text chunk: Contains a sequence of llama_token IDs from the text tokenizer
An image chunk: Contains mtmd_image_tokens representing the projected vision features
An audio chunk: Contains projected audio features

The number of media markers in the prompt must exactly match the number of bitmaps provided. The tokenizer handles the insertion of special tokens (e.g., <start_of_image>, <end_of_image>) as required by the specific model architecture.

For example, given the prompt:

"here is an image: <__media__>\ndescribe it in detail."

The tokenizer produces three chunks:

Text chunk: "here is an image: <start_of_image>" (tokenized)
Image chunk: (image token placeholders, count determined by the model architecture)
Text chunk: "<end_of_image>\ndescribe it in detail." (tokenized)

Media Encoding:

For each image or audio chunk, the encoding step runs the media data through the appropriate encoder (vision or audio) and the projection layers. This produces a tensor of floating-point embeddings with shape [n_tokens, n_embd], where n_embd matches the text model's embedding dimension. The encoding is performed by mtmd_encode_chunk(), which dispatches to the appropriate backend (vision CLIP model or audio model).

Evaluation (Decoding):

The evaluation step feeds the combined sequence through the language model:

Text chunks are processed via standard llama_decode() with token IDs
Media chunks are processed by first retrieving the encoded embeddings via mtmd_get_output_embd(), then feeding them as continuous embeddings into the model

The helper function mtmd_helper_eval_chunks() orchestrates this entire process, handling batching, position tracking, and the special attention mask requirements of certain models (e.g., Gemma-3 requires non-causal attention for image tokens).

Position Tracking:

Each token in the combined sequence has a position index that is tracked cumulatively across chunks. For standard models, one token equals one position. For models using M-RoPE (Multi-dimensional Rotary Position Embedding), the number of positions per token may differ, which is why the API provides separate mtmd_helper_get_n_tokens() and mtmd_helper_get_n_pos() functions.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment