Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Mtmd Tokenize And Encode

From Leeroopedia
Aspect Detail
Implementation Name Mtmd Tokenize And Encode
Doc Type API Doc
Domain Multimodal Inference
Purpose Tokenizing mixed text+media prompts, encoding media chunks, and evaluating the full sequence
Related Workflow Multimodal_Inference

Overview

Description

This implementation documents the three key functions that form the final stage of the multimodal inference pipeline:

  • mtmd_tokenize(): Splits a mixed text+media prompt into an ordered list of text and media chunks
  • mtmd_encode_chunk(): Encodes a single media chunk (image or audio) through the projector to produce embeddings
  • mtmd_helper_eval_chunks(): Orchestrates the full evaluation of all chunks through the language model, handling text decoding, media encoding, and batching automatically

Usage

These functions are called in sequence after bitmaps have been prepared. The typical flow is:

  1. Create an empty mtmd_input_chunks container
  2. Call mtmd_tokenize() with the text prompt and bitmap array
  3. Either manually iterate over chunks calling mtmd_encode_chunk() and llama_decode(), or use mtmd_helper_eval_chunks() for the automated pipeline

Code Reference

Aspect Detail
Header (tokenize/encode) tools/mtmd/mtmd.h:158-228
Source (tokenize) tools/mtmd/mtmd.cpp:802-809
Source (encode_chunk) tools/mtmd/mtmd.cpp:811-838
Header (helper eval) tools/mtmd/mtmd-helper.h:56-63
Source (helper eval) tools/mtmd/mtmd-helper.cpp:379-406
Import #include "mtmd.h" and #include "mtmd-helper.h"

mtmd_tokenize() signature (from mtmd.h):

// Tokenize an input text prompt and a list of bitmaps (images/audio).
// The prompt must have the input media marker (default: "<__media__>") in it.
// Number of bitmaps must equal the number of markers in the prompt.
// This function is thread-safe (shared ctx).
// Return values:
//   0 on success
//   1 on number of bitmaps not matching the number of markers
//   2 on image preprocessing error
MTMD_API int32_t mtmd_tokenize(mtmd_context * ctx,
                               mtmd_input_chunks * output,
                               const mtmd_input_text * text,
                               const mtmd_bitmap ** bitmaps,
                               size_t n_bitmaps);

mtmd_tokenize() source (mtmd.cpp:802-809):

int32_t mtmd_tokenize(mtmd_context * ctx,
            mtmd_input_chunks * output,
            const mtmd_input_text * text,
            const mtmd_bitmap ** bitmaps,
            size_t n_bitmaps) {
    mtmd_tokenizer tokenizer(ctx, text, bitmaps, n_bitmaps);
    return tokenizer.tokenize(output);
}

mtmd_encode_chunk() signature and source (mtmd.cpp:811-838):

// Returns 0 on success
MTMD_API int32_t mtmd_encode_chunk(mtmd_context * ctx,
                                   const mtmd_input_chunk * chunk);

The function dispatches based on chunk type:

  • Text chunks: No-op (returns 0 with a warning)
  • Image chunks: Encodes via the vision CLIP model
  • Audio chunks: Encodes via the audio CLIP model using clip_image_batch_encode()

mtmd_helper_eval_chunks() signature (from mtmd-helper.h):

// Automatically:
// 1. Run llama_decode() on text chunks
// 2. Run mtmd_encode_chunk() on media chunks, then mtmd_get_output_embd()
//    and then llama_decode()
// Returns 0 on success. NOT thread-safe.
MTMD_API int32_t mtmd_helper_eval_chunks(mtmd_context * ctx,
                                         struct llama_context * lctx,
                                         const mtmd_input_chunks * chunks,
                                         llama_pos n_past,
                                         llama_seq_id seq_id,
                                         int32_t n_batch,
                                         bool logits_last,
                                         llama_pos * new_n_past);

mtmd_helper_eval_chunks() source (mtmd-helper.cpp:379-406):

int32_t mtmd_helper_eval_chunks(mtmd_context * ctx,
                                struct llama_context * lctx,
                                const mtmd_input_chunks * chunks,
                                llama_pos n_past,
                                llama_seq_id seq_id,
                                int32_t n_batch,
                                bool logits_last,
                                llama_pos * new_n_past) {
    size_t n_chunks = mtmd_input_chunks_size(chunks);
    if (n_chunks == 0) {
        LOG_WRN("no chunks to eval\n");
        return 0;
    }
    for (size_t i = 0; i < n_chunks; i++) {
        bool chunk_logits_last = (i == n_chunks - 1) && logits_last;
        auto chunk = mtmd_input_chunks_get(chunks, i);
        int32_t res = mtmd_helper_eval_chunk_single(
            ctx, lctx, chunk, n_past, seq_id, n_batch, chunk_logits_last, &n_past);
        if (res != 0) {
            LOG_ERR("failed to eval chunk %zu\n", i);
            return res;
        }
        *new_n_past = n_past;
    }
    return 0;
}

I/O Contract

mtmd_tokenize():

Direction Name Type Description
Input ctx mtmd_context * Multimodal context
Input output mtmd_input_chunks * Pre-allocated empty chunks container (from mtmd_input_chunks_init())
Input text const mtmd_input_text * Input text with media markers and tokenization flags
Input bitmaps const mtmd_bitmap ** Array of bitmap pointers (one per media marker)
Input n_bitmaps size_t Number of bitmaps (must match marker count)
Output (return) int32_t 0 = success, 1 = bitmap/marker count mismatch, 2 = preprocessing error
Output output (mutated) mtmd_input_chunks * Populated with ordered text and media chunks

mtmd_encode_chunk():

Direction Name Type Description
Input ctx mtmd_context * Multimodal context (contains encoder weights)
Input chunk const mtmd_input_chunk * A single image or audio chunk from tokenization
Output (return) int32_t 0 = success, 1 = encoding failure
Output (side effect) internal Encoded embeddings stored in context, retrievable via mtmd_get_output_embd()

mtmd_helper_eval_chunks():

Direction Name Type Description
Input ctx mtmd_context * Multimodal context
Input lctx struct llama_context * Language model context for llama_decode()
Input chunks const mtmd_input_chunks * Tokenized chunks from mtmd_tokenize()
Input n_past llama_pos Starting position in KV cache
Input seq_id llama_seq_id Sequence ID for KV cache
Input n_batch int32_t Batch size for decoding
Input logits_last bool Whether to compute logits for the last token only
Output (return) int32_t 0 = success, non-zero = error from encode or decode
Output new_n_past llama_pos * Updated position after all chunks have been evaluated

Usage Examples

Example 1: Complete multimodal inference pipeline

#include "mtmd.h"
#include "mtmd-helper.h"

// Assume mtmd_ctx, llama model, and llama context are already initialized
// Assume bitmap has been loaded from file

// Prepare input text with media marker
mtmd_input_text input_text;
input_text.text = "Describe this image: <__media__>\nBe specific.";
input_text.add_special = true;
input_text.parse_special = true;

// Prepare bitmap array
const mtmd_bitmap * bitmaps[] = { bitmap_ptr };

// Tokenize
mtmd_input_chunks * chunks = mtmd_input_chunks_init();
int32_t res = mtmd_tokenize(mtmd_ctx, chunks, &input_text, bitmaps, 1);
if (res != 0) {
    fprintf(stderr, "Tokenization failed: %d\n", res);
    mtmd_input_chunks_free(chunks);
    return 1;
}

// Evaluate all chunks through the model
llama_pos n_past = 0;
res = mtmd_helper_eval_chunks(mtmd_ctx, lctx, chunks, 0, 0, 512, true, &n_past);
if (res != 0) {
    fprintf(stderr, "Evaluation failed: %d\n", res);
}

// Now sample from logits at position n_past-1
// ... standard llama.cpp sampling loop ...

mtmd_input_chunks_free(chunks);

Example 2: Manual chunk-by-chunk processing

// After tokenization, manually process each chunk
size_t n_chunks = mtmd_input_chunks_size(chunks);
for (size_t i = 0; i < n_chunks; i++) {
    const mtmd_input_chunk * chunk = mtmd_input_chunks_get(chunks, i);
    enum mtmd_input_chunk_type type = mtmd_input_chunk_get_type(chunk);

    if (type == MTMD_INPUT_CHUNK_TYPE_TEXT) {
        size_t n_tokens;
        const llama_token * tokens = mtmd_input_chunk_get_tokens_text(chunk, &n_tokens);
        // Feed tokens to llama_decode()...
    } else {
        // Encode the media chunk
        int32_t res = mtmd_encode_chunk(mtmd_ctx, chunk);
        if (res != 0) {
            fprintf(stderr, "Encoding failed\n");
            break;
        }
        // Retrieve embeddings
        float * embd = mtmd_get_output_embd(mtmd_ctx);
        size_t n_tokens = mtmd_input_chunk_get_n_tokens(chunk);
        // Feed embeddings to llama_decode()...
    }
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment