Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Llama Get Embeddings

From Leeroopedia
Field Value
Implementation Name Llama Get Embeddings
Doc Type API Doc
Domain Embedding Extraction, C API
Description llama_get_embeddings_ith(ctx, i) for per-token embeddings and llama_get_embeddings_seq(ctx, seq_id) for pooled sequence embeddings
Related Workflow Embedding_Extraction (CORE)

Overview

Description

The Llama Get Embeddings implementation provides the core C API functions for extracting embedding vectors after a llama_decode() call. Two primary functions serve different embedding granularities: llama_get_embeddings_ith() retrieves per-token embeddings by position index, and llama_get_embeddings_seq() retrieves pooled embeddings by sequence ID. A third function, llama_get_embeddings(), returns the raw contiguous embedding buffer for all output positions.

Usage

#include "llama.h"

// After llama_decode(ctx, batch):

// Per-token embeddings (when pooling_type == LLAMA_POOLING_TYPE_NONE)
float * emb_token = llama_get_embeddings_ith(ctx, 0);  // first token
float * emb_last  = llama_get_embeddings_ith(ctx, -1); // last token

// Pooled sequence embeddings (when pooling_type != LLAMA_POOLING_TYPE_NONE)
float * emb_seq0 = llama_get_embeddings_seq(ctx, 0);  // sequence 0
float * emb_seq1 = llama_get_embeddings_seq(ctx, 1);  // sequence 1

Code Reference

Field Value
Source Location (header) include/llama.h:988-999
Source Location (implementation) src/llama-context.cpp:3138-3154
Signature (get_embeddings) LLAMA_API float * llama_get_embeddings(struct llama_context * ctx)
Signature (get_embeddings_ith) LLAMA_API float * llama_get_embeddings_ith(struct llama_context * ctx, int32_t i)
Signature (get_embeddings_seq) LLAMA_API float * llama_get_embeddings_seq(struct llama_context * ctx, llama_seq_id seq_id)
Import #include "llama.h"

API declarations (include/llama.h):

// when pooling_type == LLAMA_POOLING_TYPE_NONE or when using a generative model,
// the embeddings for which llama_batch.logits[i] != 0 are stored contiguously
// in the order they have appeared in the batch.
// shape: [n_outputs*n_embd]
// Otherwise, returns NULL.
LLAMA_API float * llama_get_embeddings(struct llama_context * ctx);

// Get the embeddings for the ith token. For positive indices, Equivalent to:
// llama_get_embeddings(ctx) + ctx->output_ids[i]*n_embd
// Negative indicies can be used to access embeddings in reverse order, -1 is the last embedding.
// shape: [n_embd] (1-dimensional)
// returns NULL for invalid ids.
LLAMA_API float * llama_get_embeddings_ith(struct llama_context * ctx, int32_t i);

// Get the embeddings for a sequence id
// Returns NULL if pooling_type is LLAMA_POOLING_TYPE_NONE
// when pooling_type == LLAMA_POOLING_TYPE_RANK, returns float[n_cls_out] with the rank(s) of the sequence
// otherwise: float[n_embd] (1-dimensional)
LLAMA_API float * llama_get_embeddings_seq(struct llama_context * ctx, llama_seq_id seq_id);

Implementation (src/llama-context.cpp):

float * llama_get_embeddings(llama_context * ctx) {
    ctx->synchronize();
    return ctx->get_embeddings();
}

float * llama_get_embeddings_ith(llama_context * ctx, int32_t i) {
    ctx->synchronize();
    return ctx->get_embeddings_ith(i);
}

float * llama_get_embeddings_seq(llama_context * ctx, llama_seq_id seq_id) {
    ctx->synchronize();
    return ctx->get_embeddings_seq(seq_id);
}

Usage pattern in embedding example (batch_decode function):

static void batch_decode(llama_context * ctx, llama_batch & batch, float * output, int n_seq, int n_embd_out, int embd_norm) {
    const enum llama_pooling_type pooling_type = llama_pooling_type(ctx);

    // clear previous kv_cache values (irrelevant for embeddings)
    llama_memory_clear(llama_get_memory(ctx), true);

    // run model
    if (llama_decode(ctx, batch) < 0) {
        LOG_ERR("%s : failed to process\n", __func__);
    }

    for (int i = 0; i < batch.n_tokens; i++) {
        if (!batch.logits[i]) {
            continue;
        }

        const float * embd = nullptr;
        int embd_pos = 0;

        if (pooling_type == LLAMA_POOLING_TYPE_NONE) {
            // try to get token embeddings
            embd = llama_get_embeddings_ith(ctx, i);
            embd_pos = i;
            GGML_ASSERT(embd != NULL && "failed to get token embeddings");
        } else {
            // try to get sequence embeddings
            embd = llama_get_embeddings_seq(ctx, batch.seq_id[i][0]);
            embd_pos = batch.seq_id[i][0];
            GGML_ASSERT(embd != NULL && "failed to get sequence embeddings");
        }

        float * out = output + embd_pos * n_embd_out;
        common_embd_normalize(embd, out, n_embd_out, embd_norm);
    }
}

I/O Contract

Function Input Output Returns NULL When
llama_get_embeddings() llama_context * float * to contiguous buffer [n_outputs * n_embd] Pooling type is not NONE and model is not generative
llama_get_embeddings_ith() llama_context *, int32_t i (supports negative indexing) float * to single embedding [n_embd] Invalid index i
llama_get_embeddings_seq() llama_context *, llama_seq_id seq_id float * to pooled embedding [n_embd] or rank scores [n_cls_out] Pooling type is NONE; invalid seq_id

Synchronization behavior: All three functions call ctx->synchronize() before accessing embeddings, ensuring any pending asynchronous computation is complete. The returned pointers reference internal context memory and remain valid until the next llama_decode() call.

Embedding dimensions:

Scenario Dimension Retrieved Via
Per-token (no pooling) n_embd per token llama_get_embeddings_ith(ctx, i)
Pooled (mean/cls/last) n_embd per sequence llama_get_embeddings_seq(ctx, seq_id)
Rank (classification) n_cls_out per sequence llama_get_embeddings_seq(ctx, seq_id)

Usage Examples

Per-token embedding extraction:

// Context with LLAMA_POOLING_TYPE_NONE
for (int i = 0; i < batch.n_tokens; i++) {
    float * embd = llama_get_embeddings_ith(ctx, i);
    if (embd) {
        // Process n_embd-dimensional vector for token i
        for (int d = 0; d < n_embd; d++) {
            printf("%f ", embd[d]);
        }
        printf("\n");
    }
}

Pooled sequence embedding extraction:

// Context with LLAMA_POOLING_TYPE_MEAN (or CLS, LAST)
for (int seq = 0; seq < n_sequences; seq++) {
    float * embd = llama_get_embeddings_seq(ctx, seq);
    if (embd) {
        // Single n_embd-dimensional vector for the entire sequence
        std::vector<float> normalized(n_embd);
        common_embd_normalize(embd, normalized.data(), n_embd, 2); // L2 norm
    }
}

Reverse indexing for last token:

// Get the last token's embedding without knowing the exact count
float * last_embd = llama_get_embeddings_ith(ctx, -1);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment