Principle:Ggml org Llama cpp Input Text Preparation
| Field | Value |
|---|---|
| Principle Name | Input Text Preparation |
| Domain | Text Preprocessing, Batch Computation |
| Description | Theory of preparing and splitting input texts for batch embedding computation |
| Related Workflow | Embedding_Extraction |
Overview
Description
The Input Text Preparation principle covers the theoretical basis for transforming raw input text into a format suitable for batch embedding extraction. Before text can be tokenized and processed by the model, it must be split into individual prompts, validated against batch size constraints, and optionally paired with separator tokens for classification or reranking tasks.
The preparation pipeline addresses:
- Multi-prompt splitting: Dividing a single input string into multiple prompts using a configurable separator (default: newline).
- Tokenization and validation: Converting text prompts to token sequences and verifying that each sequence fits within the batch size limit.
- Special token handling: Ensuring EOS/SEP tokens are correctly appended as required by the embedding model's training protocol.
- Reranking pair formatting: For classification/reranking tasks, splitting query-document pairs and inserting the appropriate separator tokens or applying rerank prompt templates.
Usage
Input text preparation is applied whenever embedding extraction processes multiple texts. It is the bridge between user-provided strings and the tokenized batches the model consumes. The preparation step is critical for:
- Batch processing of multiple sentences for semantic similarity comparison
- Document collections being indexed for retrieval
- Query-document pairs in reranking workflows
Theoretical Basis
Line-based splitting provides a simple, universal protocol for specifying multiple embedding inputs. By convention, each line of the input prompt represents a separate text to embed. The separator is configurable (embd_sep parameter) to support alternative formats. This design allows piping multi-line input directly from files or other tools.
Token-level validation prevents silent failures by checking each tokenized prompt against the batch size before processing begins. If any prompt exceeds the batch limit, the program terminates with an informative error rather than producing truncated or incorrect embeddings. This fail-fast approach is essential because truncated embeddings would be silently incorrect -- the vectors would represent incomplete texts.
SEP/EOS token verification addresses a subtle correctness requirement. Many embedding models are trained with a specific end-of-sequence token that signals the boundary of the text to embed. If the tokenizer does not automatically append this token (controlled by tokenizer.ggml.add_eos_token in the GGUF metadata), the resulting embeddings may differ from what the model was trained to produce. The preparation step warns when this condition is detected.
Reranking pair construction handles the special case where two texts (query and document) must be processed together with specific separator tokens between them. This follows protocols like those used by cross-encoder models, where the relevance score depends on the joint representation of both texts. The implementation supports both template-based formatting (using {query} and {document} placeholders) and token-based formatting (inserting EOS/SEP tokens between segments).