Implementation:Ggml org Llama cpp Embedding Input Splitting
| Field | Value |
|---|---|
| Implementation Name | Embedding Input Splitting |
| Doc Type | Pattern Doc |
| Domain | Text Preprocessing, Input Parsing |
| Description | Input splitting and preprocessing pattern for embedding extraction: multi-line splitting, tokenization, and reranking pair formatting |
| Related Workflow | Embedding_Extraction |
Overview
Description
The Embedding Input Splitting implementation documents the pattern for preparing input text for batch embedding extraction as implemented in examples/embedding/embedding.cpp. The pattern involves splitting a single input string into multiple prompts using a configurable separator, tokenizing each prompt, validating token counts against batch limits, verifying EOS/SEP token presence, and handling the special case of reranking pair construction.
Usage
# Multiple prompts separated by newlines (default separator)
./llama-embedding -m model.gguf -p "First sentence
Second sentence
Third sentence"
# Custom separator
./llama-embedding -m model.gguf --embd-sep "|||" -p "First sentence|||Second sentence"
# Reranking pairs separated by cls_sep
./llama-embedding -m reranker.gguf --reranking -p "query text\tdocument text"
Code Reference
| Field | Value |
|---|---|
| Source Location (split_lines) | examples/embedding/embedding.cpp:13-27
|
| Source Location (tokenization loop) | examples/embedding/embedding.cpp:168-207
|
| Signature | static std::vector<std::string> split_lines(const std::string & s, const std::string & separator = "\n")
|
| Import | Local static function in examples/embedding/embedding.cpp
|
split_lines function:
static std::vector<std::string> split_lines(const std::string & s, const std::string & separator = "\n") {
std::vector<std::string> lines;
size_t start = 0;
size_t end = s.find(separator);
while (end != std::string::npos) {
lines.push_back(s.substr(start, end - start));
start = end + separator.length();
end = s.find(separator, start);
}
lines.push_back(s.substr(start)); // Add the last part
return lines;
}
Prompt splitting and tokenization:
// split the prompt into lines
std::vector<std::string> prompts = split_lines(params.prompt, params.embd_sep);
// get added sep and eos token, if any
const std::string added_sep_token = llama_vocab_get_add_sep(vocab)
? llama_vocab_get_text(vocab, llama_vocab_sep(vocab)) : "";
const std::string added_eos_token = llama_vocab_get_add_eos(vocab)
? llama_vocab_get_text(vocab, llama_vocab_eos(vocab)) : "";
const char * rerank_prompt = llama_model_chat_template(model, "rerank");
// tokenize the prompts and trim
std::vector<std::vector<int32_t>> inputs;
for (const auto & prompt : prompts) {
std::vector<llama_token> inp;
// split classification pairs and insert expected separator tokens
if (pooling_type == LLAMA_POOLING_TYPE_RANK && prompt.find(params.cls_sep) != std::string::npos) {
std::vector<std::string> pairs = split_lines(prompt, params.cls_sep);
if (rerank_prompt != nullptr) {
const std::string query = pairs[0];
const std::string doc = pairs[1];
std::string final_prompt = rerank_prompt;
string_replace_all(final_prompt, "{query}" , query);
string_replace_all(final_prompt, "{document}", doc );
inp = common_tokenize(vocab, final_prompt, true, true);
} else {
std::string final_prompt;
for (size_t i = 0; i < pairs.size(); i++) {
final_prompt += pairs[i];
if (i != pairs.size() - 1) {
if (!added_eos_token.empty()) final_prompt += added_eos_token;
if (!added_sep_token.empty()) final_prompt += added_sep_token;
}
}
inp = common_tokenize(ctx, final_prompt, true, true);
}
} else {
inp = common_tokenize(ctx, prompt, true, true);
}
if (inp.size() > n_batch) {
LOG_ERR("%s: number of tokens in input line (%lld) exceeds batch size (%lld)\n",
__func__, (long long int) inp.size(), (long long int) n_batch);
return 1;
}
inputs.push_back(inp);
}
EOS/SEP token verification:
// check if the last token is SEP/EOS
for (auto & inp : inputs) {
if (inp.empty() || (inp.back() != llama_vocab_sep(vocab) && inp.back() != llama_vocab_eos(vocab))) {
LOG_WRN("%s: last token in the prompt is not SEP or EOS\n", __func__);
LOG_WRN("%s: 'tokenizer.ggml.add_eos_token' should be set to 'true' in the GGUF header\n", __func__);
}
}
I/O Contract
| Direction | Description |
|---|---|
| Input | Single string (params.prompt) containing one or more texts separated by params.embd_sep (default: "\n")
|
| Output | std::vector<std::vector<int32_t>> -- a vector of tokenized prompts ready for batch processing
|
| Preconditions | Model and vocabulary must be loaded; n_batch must be set
|
| Error Handling | Returns error code 1 if any tokenized prompt exceeds n_batch; warns if EOS/SEP token is missing from tokenized output
|
Processing pipeline:
| Step | Operation | Output |
|---|---|---|
| 1 | Split input string by separator | std::vector<std::string> of individual prompts
|
| 2 | Detect reranking pairs (if RANK pooling) | Query/document pairs identified by cls_sep
|
| 3 | Apply rerank template or insert separator tokens | Formatted prompt string with special tokens |
| 4 | Tokenize each prompt | std::vector<int32_t> token sequences
|
| 5 | Validate token count vs. batch size | Error if any sequence exceeds limit |
| 6 | Verify EOS/SEP token presence | Warning if missing |
Usage Examples
Simple multi-line embedding:
./llama-embedding -m bge-small.gguf -p "The cat sat on the mat
The dog ran in the park
Machine learning is fascinating"
This produces three separate embedding vectors, one per line.
Reranking with query-document pairs:
./llama-embedding -m reranker.gguf --reranking --cls-sep "\t" \
-p "what is machine learning\tMachine learning is a subset of AI
what is machine learning\tThe weather today is sunny"
Each line contains a query-document pair separated by tab. The reranker assigns relevance scores.
JSON output format:
./llama-embedding -m model.gguf --embd-out json -p "Hello world\nGoodbye world"
Produces OpenAI-compatible JSON with embedding arrays.