Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ggml org Llama cpp Embedding Batch Tokenization

From Leeroopedia
Field Value
Principle Name Embedding Batch Tokenization
Domain Tokenization, Batch Processing
Description Theory of batch tokenization with sequence ID assignment for multi-input embedding extraction
Related Workflow Embedding_Extraction

Overview

Description

The Embedding Batch Tokenization principle defines the theory behind constructing token batches for efficient multi-input embedding extraction. When computing embeddings for multiple texts, the tokenized sequences must be packed into a shared batch structure with correct position encodings and sequence ID assignments. This enables the model to process multiple inputs in a single forward pass while maintaining per-sequence separation.

The batch tokenization process addresses:

  • Token packing: Adding individual tokens to the batch with their token IDs, positional indices, and sequence membership.
  • Sequence ID assignment: Each input text is assigned a unique sequence ID so that pooling operations can distinguish which tokens belong to which input.
  • Logits flag management: Controlling which tokens produce output embeddings. For pooled embeddings, typically only the last token per sequence (or all tokens for NONE pooling) needs the logits flag set.
  • Batch capacity management: Detecting when the batch is full and triggering a decode operation before adding more tokens, enabling processing of more inputs than can fit in a single batch.

Usage

Batch tokenization is used whenever multiple inputs need to be processed in a single inference pass. It is the mechanism that enables efficient throughput by amortizing the overhead of model computation across multiple inputs. The pattern is used in both the standalone embedding example and the server's embedding endpoint.

Theoretical Basis

Sequence-aware batching is the key abstraction that enables multi-input embedding computation. The llama_batch structure supports multiple sequences within a single batch through the seq_id field. Each token is tagged with one or more sequence IDs, allowing the attention mechanism to correctly scope attention masks -- tokens attend only to other tokens in the same sequence (for non-causal/embedding models).

Position encoding independence means each sequence maintains its own position counter starting from zero. Token positions are assigned sequentially within each sequence (0, 1, 2, ..., n-1) regardless of where the sequence appears in the batch. This ensures positional embeddings are computed correctly for each input text independently.

The logits flag controls which token positions produce output embeddings. For efficiency:

  • With LLAMA_POOLING_TYPE_NONE: All tokens need logits=true since each produces an independent embedding.
  • With pooled types (MEAN, CLS, LAST): Setting logits=true on all tokens in a sequence still works, though only the pooled result is typically retrieved via llama_get_embeddings_seq().

Batch overflow handling implements a streaming pattern where tokens are accumulated until the batch reaches capacity (either token count or sequence count limit), then a decode is triggered, results are extracted, and the batch is cleared for the next group of inputs. This allows processing an arbitrary number of inputs regardless of batch size, trading latency for completeness.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment