Implementation:Ggml org Llama cpp Mtmd Tokenize And Encode
| Aspect | Detail |
|---|---|
| Implementation Name | Mtmd Tokenize And Encode |
| Doc Type | API Doc |
| Domain | Multimodal Inference |
| Purpose | Tokenizing mixed text+media prompts, encoding media chunks, and evaluating the full sequence |
| Related Workflow | Multimodal_Inference |
Overview
Description
This implementation documents the three key functions that form the final stage of the multimodal inference pipeline:
mtmd_tokenize(): Splits a mixed text+media prompt into an ordered list of text and media chunksmtmd_encode_chunk(): Encodes a single media chunk (image or audio) through the projector to produce embeddingsmtmd_helper_eval_chunks(): Orchestrates the full evaluation of all chunks through the language model, handling text decoding, media encoding, and batching automatically
Usage
These functions are called in sequence after bitmaps have been prepared. The typical flow is:
- Create an empty
mtmd_input_chunkscontainer - Call
mtmd_tokenize()with the text prompt and bitmap array - Either manually iterate over chunks calling
mtmd_encode_chunk()andllama_decode(), or usemtmd_helper_eval_chunks()for the automated pipeline
Code Reference
| Aspect | Detail |
|---|---|
| Header (tokenize/encode) | tools/mtmd/mtmd.h:158-228
|
| Source (tokenize) | tools/mtmd/mtmd.cpp:802-809
|
| Source (encode_chunk) | tools/mtmd/mtmd.cpp:811-838
|
| Header (helper eval) | tools/mtmd/mtmd-helper.h:56-63
|
| Source (helper eval) | tools/mtmd/mtmd-helper.cpp:379-406
|
| Import | #include "mtmd.h" and #include "mtmd-helper.h"
|
mtmd_tokenize() signature (from mtmd.h):
// Tokenize an input text prompt and a list of bitmaps (images/audio).
// The prompt must have the input media marker (default: "<__media__>") in it.
// Number of bitmaps must equal the number of markers in the prompt.
// This function is thread-safe (shared ctx).
// Return values:
// 0 on success
// 1 on number of bitmaps not matching the number of markers
// 2 on image preprocessing error
MTMD_API int32_t mtmd_tokenize(mtmd_context * ctx,
mtmd_input_chunks * output,
const mtmd_input_text * text,
const mtmd_bitmap ** bitmaps,
size_t n_bitmaps);
mtmd_tokenize() source (mtmd.cpp:802-809):
int32_t mtmd_tokenize(mtmd_context * ctx,
mtmd_input_chunks * output,
const mtmd_input_text * text,
const mtmd_bitmap ** bitmaps,
size_t n_bitmaps) {
mtmd_tokenizer tokenizer(ctx, text, bitmaps, n_bitmaps);
return tokenizer.tokenize(output);
}
mtmd_encode_chunk() signature and source (mtmd.cpp:811-838):
// Returns 0 on success
MTMD_API int32_t mtmd_encode_chunk(mtmd_context * ctx,
const mtmd_input_chunk * chunk);
The function dispatches based on chunk type:
- Text chunks: No-op (returns 0 with a warning)
- Image chunks: Encodes via the vision CLIP model
- Audio chunks: Encodes via the audio CLIP model using
clip_image_batch_encode()
mtmd_helper_eval_chunks() signature (from mtmd-helper.h):
// Automatically:
// 1. Run llama_decode() on text chunks
// 2. Run mtmd_encode_chunk() on media chunks, then mtmd_get_output_embd()
// and then llama_decode()
// Returns 0 on success. NOT thread-safe.
MTMD_API int32_t mtmd_helper_eval_chunks(mtmd_context * ctx,
struct llama_context * lctx,
const mtmd_input_chunks * chunks,
llama_pos n_past,
llama_seq_id seq_id,
int32_t n_batch,
bool logits_last,
llama_pos * new_n_past);
mtmd_helper_eval_chunks() source (mtmd-helper.cpp:379-406):
int32_t mtmd_helper_eval_chunks(mtmd_context * ctx,
struct llama_context * lctx,
const mtmd_input_chunks * chunks,
llama_pos n_past,
llama_seq_id seq_id,
int32_t n_batch,
bool logits_last,
llama_pos * new_n_past) {
size_t n_chunks = mtmd_input_chunks_size(chunks);
if (n_chunks == 0) {
LOG_WRN("no chunks to eval\n");
return 0;
}
for (size_t i = 0; i < n_chunks; i++) {
bool chunk_logits_last = (i == n_chunks - 1) && logits_last;
auto chunk = mtmd_input_chunks_get(chunks, i);
int32_t res = mtmd_helper_eval_chunk_single(
ctx, lctx, chunk, n_past, seq_id, n_batch, chunk_logits_last, &n_past);
if (res != 0) {
LOG_ERR("failed to eval chunk %zu\n", i);
return res;
}
*new_n_past = n_past;
}
return 0;
}
I/O Contract
mtmd_tokenize():
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | ctx | mtmd_context * |
Multimodal context |
| Input | output | mtmd_input_chunks * |
Pre-allocated empty chunks container (from mtmd_input_chunks_init())
|
| Input | text | const mtmd_input_text * |
Input text with media markers and tokenization flags |
| Input | bitmaps | const mtmd_bitmap ** |
Array of bitmap pointers (one per media marker) |
| Input | n_bitmaps | size_t |
Number of bitmaps (must match marker count) |
| Output | (return) | int32_t |
0 = success, 1 = bitmap/marker count mismatch, 2 = preprocessing error |
| Output | output (mutated) | mtmd_input_chunks * |
Populated with ordered text and media chunks |
mtmd_encode_chunk():
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | ctx | mtmd_context * |
Multimodal context (contains encoder weights) |
| Input | chunk | const mtmd_input_chunk * |
A single image or audio chunk from tokenization |
| Output | (return) | int32_t |
0 = success, 1 = encoding failure |
| Output | (side effect) | internal | Encoded embeddings stored in context, retrievable via mtmd_get_output_embd()
|
mtmd_helper_eval_chunks():
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | ctx | mtmd_context * |
Multimodal context |
| Input | lctx | struct llama_context * |
Language model context for llama_decode()
|
| Input | chunks | const mtmd_input_chunks * |
Tokenized chunks from mtmd_tokenize()
|
| Input | n_past | llama_pos |
Starting position in KV cache |
| Input | seq_id | llama_seq_id |
Sequence ID for KV cache |
| Input | n_batch | int32_t |
Batch size for decoding |
| Input | logits_last | bool |
Whether to compute logits for the last token only |
| Output | (return) | int32_t |
0 = success, non-zero = error from encode or decode |
| Output | new_n_past | llama_pos * |
Updated position after all chunks have been evaluated |
Usage Examples
Example 1: Complete multimodal inference pipeline
#include "mtmd.h"
#include "mtmd-helper.h"
// Assume mtmd_ctx, llama model, and llama context are already initialized
// Assume bitmap has been loaded from file
// Prepare input text with media marker
mtmd_input_text input_text;
input_text.text = "Describe this image: <__media__>\nBe specific.";
input_text.add_special = true;
input_text.parse_special = true;
// Prepare bitmap array
const mtmd_bitmap * bitmaps[] = { bitmap_ptr };
// Tokenize
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
int32_t res = mtmd_tokenize(mtmd_ctx, chunks, &input_text, bitmaps, 1);
if (res != 0) {
fprintf(stderr, "Tokenization failed: %d\n", res);
mtmd_input_chunks_free(chunks);
return 1;
}
// Evaluate all chunks through the model
llama_pos n_past = 0;
res = mtmd_helper_eval_chunks(mtmd_ctx, lctx, chunks, 0, 0, 512, true, &n_past);
if (res != 0) {
fprintf(stderr, "Evaluation failed: %d\n", res);
}
// Now sample from logits at position n_past-1
// ... standard llama.cpp sampling loop ...
mtmd_input_chunks_free(chunks);
Example 2: Manual chunk-by-chunk processing
// After tokenization, manually process each chunk
size_t n_chunks = mtmd_input_chunks_size(chunks);
for (size_t i = 0; i < n_chunks; i++) {
const mtmd_input_chunk * chunk = mtmd_input_chunks_get(chunks, i);
enum mtmd_input_chunk_type type = mtmd_input_chunk_get_type(chunk);
if (type == MTMD_INPUT_CHUNK_TYPE_TEXT) {
size_t n_tokens;
const llama_token * tokens = mtmd_input_chunk_get_tokens_text(chunk, &n_tokens);
// Feed tokens to llama_decode()...
} else {
// Encode the media chunk
int32_t res = mtmd_encode_chunk(mtmd_ctx, chunk);
if (res != 0) {
fprintf(stderr, "Encoding failed\n");
break;
}
// Retrieve embeddings
float * embd = mtmd_get_output_embd(mtmd_ctx);
size_t n_tokens = mtmd_input_chunk_get_n_tokens(chunk);
// Feed embeddings to llama_decode()...
}
}