Implementation:Ggml org Llama cpp Llama Get Embeddings

Field	Value
Implementation Name	Llama Get Embeddings
Doc Type	API Doc
Domain	Embedding Extraction, C API
Description	`llama_get_embeddings_ith(ctx, i)` for per-token embeddings and `llama_get_embeddings_seq(ctx, seq_id)` for pooled sequence embeddings
Related Workflow	Embedding_Extraction (CORE)

Overview

Description

The Llama Get Embeddings implementation provides the core C API functions for extracting embedding vectors after a llama_decode() call. Two primary functions serve different embedding granularities: llama_get_embeddings_ith() retrieves per-token embeddings by position index, and llama_get_embeddings_seq() retrieves pooled embeddings by sequence ID. A third function, llama_get_embeddings(), returns the raw contiguous embedding buffer for all output positions.

Usage

#include "llama.h"

// After llama_decode(ctx, batch):

// Per-token embeddings (when pooling_type == LLAMA_POOLING_TYPE_NONE)
float * emb_token = llama_get_embeddings_ith(ctx, 0);  // first token
float * emb_last  = llama_get_embeddings_ith(ctx, -1); // last token

// Pooled sequence embeddings (when pooling_type != LLAMA_POOLING_TYPE_NONE)
float * emb_seq0 = llama_get_embeddings_seq(ctx, 0);  // sequence 0
float * emb_seq1 = llama_get_embeddings_seq(ctx, 1);  // sequence 1

Code Reference

Field	Value
Source Location (header)	`include/llama.h:988-999`
Source Location (implementation)	`src/llama-context.cpp:3138-3154`
Signature (get_embeddings)	`LLAMA_API float * llama_get_embeddings(struct llama_context * ctx)`
Signature (get_embeddings_ith)	`LLAMA_API float * llama_get_embeddings_ith(struct llama_context * ctx, int32_t i)`
Signature (get_embeddings_seq)	`LLAMA_API float * llama_get_embeddings_seq(struct llama_context * ctx, llama_seq_id seq_id)`
Import	`#include "llama.h"`

API declarations (include/llama.h):

// when pooling_type == LLAMA_POOLING_TYPE_NONE or when using a generative model,
// the embeddings for which llama_batch.logits[i] != 0 are stored contiguously
// in the order they have appeared in the batch.
// shape: [n_outputs*n_embd]
// Otherwise, returns NULL.
LLAMA_API float * llama_get_embeddings(struct llama_context * ctx);

// Get the embeddings for the ith token. For positive indices, Equivalent to:
// llama_get_embeddings(ctx) + ctx->output_ids[i]*n_embd
// Negative indicies can be used to access embeddings in reverse order, -1 is the last embedding.
// shape: [n_embd] (1-dimensional)
// returns NULL for invalid ids.
LLAMA_API float * llama_get_embeddings_ith(struct llama_context * ctx, int32_t i);

// Get the embeddings for a sequence id
// Returns NULL if pooling_type is LLAMA_POOLING_TYPE_NONE
// when pooling_type == LLAMA_POOLING_TYPE_RANK, returns float[n_cls_out] with the rank(s) of the sequence
// otherwise: float[n_embd] (1-dimensional)
LLAMA_API float * llama_get_embeddings_seq(struct llama_context * ctx, llama_seq_id seq_id);

Implementation (src/llama-context.cpp):

float * llama_get_embeddings(llama_context * ctx) {
    ctx->synchronize();
    return ctx->get_embeddings();
}

float * llama_get_embeddings_ith(llama_context * ctx, int32_t i) {
    ctx->synchronize();
    return ctx->get_embeddings_ith(i);
}

float * llama_get_embeddings_seq(llama_context * ctx, llama_seq_id seq_id) {
    ctx->synchronize();
    return ctx->get_embeddings_seq(seq_id);
}

Usage pattern in embedding example (batch_decode function):

static void batch_decode(llama_context * ctx, llama_batch & batch, float * output, int n_seq, int n_embd_out, int embd_norm) {
    const enum llama_pooling_type pooling_type = llama_pooling_type(ctx);

    // clear previous kv_cache values (irrelevant for embeddings)
    llama_memory_clear(llama_get_memory(ctx), true);

    // run model
    if (llama_decode(ctx, batch) < 0) {
        LOG_ERR("%s : failed to process\n", __func__);
    }

    for (int i = 0; i < batch.n_tokens; i++) {
        if (!batch.logits[i]) {
            continue;
        }

        const float * embd = nullptr;
        int embd_pos = 0;

        if (pooling_type == LLAMA_POOLING_TYPE_NONE) {
            // try to get token embeddings
            embd = llama_get_embeddings_ith(ctx, i);
            embd_pos = i;
            GGML_ASSERT(embd != NULL && "failed to get token embeddings");
        } else {
            // try to get sequence embeddings
            embd = llama_get_embeddings_seq(ctx, batch.seq_id[i][0]);
            embd_pos = batch.seq_id[i][0];
            GGML_ASSERT(embd != NULL && "failed to get sequence embeddings");
        }

        float * out = output + embd_pos * n_embd_out;
        common_embd_normalize(embd, out, n_embd_out, embd_norm);
    }
}

I/O Contract

Function	Input	Output	Returns NULL When
`llama_get_embeddings()`	`llama_context *`	`float ` to contiguous buffer `[n_outputs n_embd]`	Pooling type is not NONE and model is not generative
`llama_get_embeddings_ith()`	`llama_context *`, `int32_t i` (supports negative indexing)	`float *` to single embedding `[n_embd]`	Invalid index `i`
`llama_get_embeddings_seq()`	`llama_context *`, `llama_seq_id seq_id`	`float *` to pooled embedding `[n_embd]` or rank scores `[n_cls_out]`	Pooling type is NONE; invalid `seq_id`

Synchronization behavior: All three functions call ctx->synchronize() before accessing embeddings, ensuring any pending asynchronous computation is complete. The returned pointers reference internal context memory and remain valid until the next llama_decode() call.

Embedding dimensions:

Scenario	Dimension	Retrieved Via
Per-token (no pooling)	`n_embd` per token	`llama_get_embeddings_ith(ctx, i)`
Pooled (mean/cls/last)	`n_embd` per sequence	`llama_get_embeddings_seq(ctx, seq_id)`
Rank (classification)	`n_cls_out` per sequence	`llama_get_embeddings_seq(ctx, seq_id)`

Usage Examples

Per-token embedding extraction:

// Context with LLAMA_POOLING_TYPE_NONE
for (int i = 0; i < batch.n_tokens; i++) {
    float * embd = llama_get_embeddings_ith(ctx, i);
    if (embd) {
        // Process n_embd-dimensional vector for token i
        for (int d = 0; d < n_embd; d++) {
            printf("%f ", embd[d]);
        }
        printf("\n");
    }
}

Pooled sequence embedding extraction:

// Context with LLAMA_POOLING_TYPE_MEAN (or CLS, LAST)
for (int seq = 0; seq < n_sequences; seq++) {
    float * embd = llama_get_embeddings_seq(ctx, seq);
    if (embd) {
        // Single n_embd-dimensional vector for the entire sequence
        std::vector<float> normalized(n_embd);
        common_embd_normalize(embd, normalized.data(), n_embd, 2); // L2 norm
    }
}

Reverse indexing for last token:

// Get the last token's embedding without knowing the exact count
float * last_embd = llama_get_embeddings_ith(ctx, -1);

Related Pages

Principle:Ggml_org_Llama_cpp_Embedding_Computation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment