Principle:Ggml org Llama cpp Embedding Batch Tokenization
| Field | Value |
|---|---|
| Principle Name | Embedding Batch Tokenization |
| Domain | Tokenization, Batch Processing |
| Description | Theory of batch tokenization with sequence ID assignment for multi-input embedding extraction |
| Related Workflow | Embedding_Extraction |
Overview
Description
The Embedding Batch Tokenization principle defines the theory behind constructing token batches for efficient multi-input embedding extraction. When computing embeddings for multiple texts, the tokenized sequences must be packed into a shared batch structure with correct position encodings and sequence ID assignments. This enables the model to process multiple inputs in a single forward pass while maintaining per-sequence separation.
The batch tokenization process addresses:
- Token packing: Adding individual tokens to the batch with their token IDs, positional indices, and sequence membership.
- Sequence ID assignment: Each input text is assigned a unique sequence ID so that pooling operations can distinguish which tokens belong to which input.
- Logits flag management: Controlling which tokens produce output embeddings. For pooled embeddings, typically only the last token per sequence (or all tokens for NONE pooling) needs the logits flag set.
- Batch capacity management: Detecting when the batch is full and triggering a decode operation before adding more tokens, enabling processing of more inputs than can fit in a single batch.
Usage
Batch tokenization is used whenever multiple inputs need to be processed in a single inference pass. It is the mechanism that enables efficient throughput by amortizing the overhead of model computation across multiple inputs. The pattern is used in both the standalone embedding example and the server's embedding endpoint.
Theoretical Basis
Sequence-aware batching is the key abstraction that enables multi-input embedding computation. The llama_batch structure supports multiple sequences within a single batch through the seq_id field. Each token is tagged with one or more sequence IDs, allowing the attention mechanism to correctly scope attention masks -- tokens attend only to other tokens in the same sequence (for non-causal/embedding models).
Position encoding independence means each sequence maintains its own position counter starting from zero. Token positions are assigned sequentially within each sequence (0, 1, 2, ..., n-1) regardless of where the sequence appears in the batch. This ensures positional embeddings are computed correctly for each input text independently.
The logits flag controls which token positions produce output embeddings. For efficiency:
- With LLAMA_POOLING_TYPE_NONE: All tokens need logits=true since each produces an independent embedding.
- With pooled types (MEAN, CLS, LAST): Setting logits=true on all tokens in a sequence still works, though only the pooled result is typically retrieved via
llama_get_embeddings_seq().
Batch overflow handling implements a streaming pattern where tokens are accumulated until the batch reaches capacity (either token count or sequence count limit), then a decode is triggered, results are extracted, and the batch is cleared for the next group of inputs. This allows processing an arbitrary number of inputs regardless of batch size, trading latency for completeness.