Principle:Ggml org Llama cpp Input Text Preparation

Field	Value
Principle Name	Input Text Preparation
Domain	Text Preprocessing, Batch Computation
Description	Theory of preparing and splitting input texts for batch embedding computation
Related Workflow	Embedding_Extraction

Overview

Description

The Input Text Preparation principle covers the theoretical basis for transforming raw input text into a format suitable for batch embedding extraction. Before text can be tokenized and processed by the model, it must be split into individual prompts, validated against batch size constraints, and optionally paired with separator tokens for classification or reranking tasks.

The preparation pipeline addresses:

Multi-prompt splitting: Dividing a single input string into multiple prompts using a configurable separator (default: newline).
Tokenization and validation: Converting text prompts to token sequences and verifying that each sequence fits within the batch size limit.
Special token handling: Ensuring EOS/SEP tokens are correctly appended as required by the embedding model's training protocol.
Reranking pair formatting: For classification/reranking tasks, splitting query-document pairs and inserting the appropriate separator tokens or applying rerank prompt templates.

Usage

Input text preparation is applied whenever embedding extraction processes multiple texts. It is the bridge between user-provided strings and the tokenized batches the model consumes. The preparation step is critical for:

Batch processing of multiple sentences for semantic similarity comparison
Document collections being indexed for retrieval
Query-document pairs in reranking workflows

Theoretical Basis

Line-based splitting provides a simple, universal protocol for specifying multiple embedding inputs. By convention, each line of the input prompt represents a separate text to embed. The separator is configurable (embd_sep parameter) to support alternative formats. This design allows piping multi-line input directly from files or other tools.

Token-level validation prevents silent failures by checking each tokenized prompt against the batch size before processing begins. If any prompt exceeds the batch limit, the program terminates with an informative error rather than producing truncated or incorrect embeddings. This fail-fast approach is essential because truncated embeddings would be silently incorrect -- the vectors would represent incomplete texts.

SEP/EOS token verification addresses a subtle correctness requirement. Many embedding models are trained with a specific end-of-sequence token that signals the boundary of the text to embed. If the tokenizer does not automatically append this token (controlled by tokenizer.ggml.add_eos_token in the GGUF metadata), the resulting embeddings may differ from what the model was trained to produce. The preparation step warns when this condition is detected.

Reranking pair construction handles the special case where two texts (query and document) must be processed together with specific separator tokens between them. This follows protocols like those used by cross-encoder models, where the relevance score depends on the joint representation of both texts. The implementation supports both template-based formatting (using {query} and {document} placeholders) and token-based formatting (inserting EOS/SEP tokens between segments).

Related Pages

Implementation:Ggml_org_Llama_cpp_Embedding_Input_Splitting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment