Implementation:Ggml org Llama cpp Llama Get Embeddings
| Field | Value |
|---|---|
| Implementation Name | Llama Get Embeddings |
| Doc Type | API Doc |
| Domain | Embedding Extraction, C API |
| Description | llama_get_embeddings_ith(ctx, i) for per-token embeddings and llama_get_embeddings_seq(ctx, seq_id) for pooled sequence embeddings
|
| Related Workflow | Embedding_Extraction (CORE) |
Overview
Description
The Llama Get Embeddings implementation provides the core C API functions for extracting embedding vectors after a llama_decode() call. Two primary functions serve different embedding granularities: llama_get_embeddings_ith() retrieves per-token embeddings by position index, and llama_get_embeddings_seq() retrieves pooled embeddings by sequence ID. A third function, llama_get_embeddings(), returns the raw contiguous embedding buffer for all output positions.
Usage
#include "llama.h"
// After llama_decode(ctx, batch):
// Per-token embeddings (when pooling_type == LLAMA_POOLING_TYPE_NONE)
float * emb_token = llama_get_embeddings_ith(ctx, 0); // first token
float * emb_last = llama_get_embeddings_ith(ctx, -1); // last token
// Pooled sequence embeddings (when pooling_type != LLAMA_POOLING_TYPE_NONE)
float * emb_seq0 = llama_get_embeddings_seq(ctx, 0); // sequence 0
float * emb_seq1 = llama_get_embeddings_seq(ctx, 1); // sequence 1
Code Reference
| Field | Value |
|---|---|
| Source Location (header) | include/llama.h:988-999
|
| Source Location (implementation) | src/llama-context.cpp:3138-3154
|
| Signature (get_embeddings) | LLAMA_API float * llama_get_embeddings(struct llama_context * ctx)
|
| Signature (get_embeddings_ith) | LLAMA_API float * llama_get_embeddings_ith(struct llama_context * ctx, int32_t i)
|
| Signature (get_embeddings_seq) | LLAMA_API float * llama_get_embeddings_seq(struct llama_context * ctx, llama_seq_id seq_id)
|
| Import | #include "llama.h"
|
API declarations (include/llama.h):
// when pooling_type == LLAMA_POOLING_TYPE_NONE or when using a generative model,
// the embeddings for which llama_batch.logits[i] != 0 are stored contiguously
// in the order they have appeared in the batch.
// shape: [n_outputs*n_embd]
// Otherwise, returns NULL.
LLAMA_API float * llama_get_embeddings(struct llama_context * ctx);
// Get the embeddings for the ith token. For positive indices, Equivalent to:
// llama_get_embeddings(ctx) + ctx->output_ids[i]*n_embd
// Negative indicies can be used to access embeddings in reverse order, -1 is the last embedding.
// shape: [n_embd] (1-dimensional)
// returns NULL for invalid ids.
LLAMA_API float * llama_get_embeddings_ith(struct llama_context * ctx, int32_t i);
// Get the embeddings for a sequence id
// Returns NULL if pooling_type is LLAMA_POOLING_TYPE_NONE
// when pooling_type == LLAMA_POOLING_TYPE_RANK, returns float[n_cls_out] with the rank(s) of the sequence
// otherwise: float[n_embd] (1-dimensional)
LLAMA_API float * llama_get_embeddings_seq(struct llama_context * ctx, llama_seq_id seq_id);
Implementation (src/llama-context.cpp):
float * llama_get_embeddings(llama_context * ctx) {
ctx->synchronize();
return ctx->get_embeddings();
}
float * llama_get_embeddings_ith(llama_context * ctx, int32_t i) {
ctx->synchronize();
return ctx->get_embeddings_ith(i);
}
float * llama_get_embeddings_seq(llama_context * ctx, llama_seq_id seq_id) {
ctx->synchronize();
return ctx->get_embeddings_seq(seq_id);
}
Usage pattern in embedding example (batch_decode function):
static void batch_decode(llama_context * ctx, llama_batch & batch, float * output, int n_seq, int n_embd_out, int embd_norm) {
const enum llama_pooling_type pooling_type = llama_pooling_type(ctx);
// clear previous kv_cache values (irrelevant for embeddings)
llama_memory_clear(llama_get_memory(ctx), true);
// run model
if (llama_decode(ctx, batch) < 0) {
LOG_ERR("%s : failed to process\n", __func__);
}
for (int i = 0; i < batch.n_tokens; i++) {
if (!batch.logits[i]) {
continue;
}
const float * embd = nullptr;
int embd_pos = 0;
if (pooling_type == LLAMA_POOLING_TYPE_NONE) {
// try to get token embeddings
embd = llama_get_embeddings_ith(ctx, i);
embd_pos = i;
GGML_ASSERT(embd != NULL && "failed to get token embeddings");
} else {
// try to get sequence embeddings
embd = llama_get_embeddings_seq(ctx, batch.seq_id[i][0]);
embd_pos = batch.seq_id[i][0];
GGML_ASSERT(embd != NULL && "failed to get sequence embeddings");
}
float * out = output + embd_pos * n_embd_out;
common_embd_normalize(embd, out, n_embd_out, embd_norm);
}
}
I/O Contract
| Function | Input | Output | Returns NULL When |
|---|---|---|---|
llama_get_embeddings() |
llama_context * |
float * to contiguous buffer [n_outputs * n_embd] |
Pooling type is not NONE and model is not generative |
llama_get_embeddings_ith() |
llama_context *, int32_t i (supports negative indexing) |
float * to single embedding [n_embd] |
Invalid index i
|
llama_get_embeddings_seq() |
llama_context *, llama_seq_id seq_id |
float * to pooled embedding [n_embd] or rank scores [n_cls_out] |
Pooling type is NONE; invalid seq_id
|
Synchronization behavior: All three functions call ctx->synchronize() before accessing embeddings, ensuring any pending asynchronous computation is complete. The returned pointers reference internal context memory and remain valid until the next llama_decode() call.
Embedding dimensions:
| Scenario | Dimension | Retrieved Via |
|---|---|---|
| Per-token (no pooling) | n_embd per token |
llama_get_embeddings_ith(ctx, i)
|
| Pooled (mean/cls/last) | n_embd per sequence |
llama_get_embeddings_seq(ctx, seq_id)
|
| Rank (classification) | n_cls_out per sequence |
llama_get_embeddings_seq(ctx, seq_id)
|
Usage Examples
Per-token embedding extraction:
// Context with LLAMA_POOLING_TYPE_NONE
for (int i = 0; i < batch.n_tokens; i++) {
float * embd = llama_get_embeddings_ith(ctx, i);
if (embd) {
// Process n_embd-dimensional vector for token i
for (int d = 0; d < n_embd; d++) {
printf("%f ", embd[d]);
}
printf("\n");
}
}
Pooled sequence embedding extraction:
// Context with LLAMA_POOLING_TYPE_MEAN (or CLS, LAST)
for (int seq = 0; seq < n_sequences; seq++) {
float * embd = llama_get_embeddings_seq(ctx, seq);
if (embd) {
// Single n_embd-dimensional vector for the entire sequence
std::vector<float> normalized(n_embd);
common_embd_normalize(embd, normalized.data(), n_embd, 2); // L2 norm
}
}
Reverse indexing for last token:
// Get the last token's embedding without knowing the exact count
float * last_embd = llama_get_embeddings_ith(ctx, -1);