Implementation:Ggml org Llama cpp Embedding Input Splitting

Field	Value
Implementation Name	Embedding Input Splitting
Doc Type	Pattern Doc
Domain	Text Preprocessing, Input Parsing
Description	Input splitting and preprocessing pattern for embedding extraction: multi-line splitting, tokenization, and reranking pair formatting
Related Workflow	Embedding_Extraction

Overview

Description

The Embedding Input Splitting implementation documents the pattern for preparing input text for batch embedding extraction as implemented in examples/embedding/embedding.cpp. The pattern involves splitting a single input string into multiple prompts using a configurable separator, tokenizing each prompt, validating token counts against batch limits, verifying EOS/SEP token presence, and handling the special case of reranking pair construction.

Usage

# Multiple prompts separated by newlines (default separator)
./llama-embedding -m model.gguf -p "First sentence
Second sentence
Third sentence"

# Custom separator
./llama-embedding -m model.gguf --embd-sep "|||" -p "First sentence|||Second sentence"

# Reranking pairs separated by cls_sep
./llama-embedding -m reranker.gguf --reranking -p "query text\tdocument text"

Code Reference

Field	Value
Source Location (split_lines)	`examples/embedding/embedding.cpp:13-27`
Source Location (tokenization loop)	`examples/embedding/embedding.cpp:168-207`
Signature	`static std::vector<std::string> split_lines(const std::string & s, const std::string & separator = "\n")`
Import	Local static function in `examples/embedding/embedding.cpp`

split_lines function:

static std::vector<std::string> split_lines(const std::string & s, const std::string & separator = "\n") {
    std::vector<std::string> lines;
    size_t start = 0;
    size_t end = s.find(separator);

    while (end != std::string::npos) {
        lines.push_back(s.substr(start, end - start));
        start = end + separator.length();
        end = s.find(separator, start);
    }

    lines.push_back(s.substr(start)); // Add the last part

    return lines;
}

Prompt splitting and tokenization:

// split the prompt into lines
std::vector<std::string> prompts = split_lines(params.prompt, params.embd_sep);

// get added sep and eos token, if any
const std::string added_sep_token = llama_vocab_get_add_sep(vocab)
    ? llama_vocab_get_text(vocab, llama_vocab_sep(vocab)) : "";
const std::string added_eos_token = llama_vocab_get_add_eos(vocab)
    ? llama_vocab_get_text(vocab, llama_vocab_eos(vocab)) : "";
const char * rerank_prompt = llama_model_chat_template(model, "rerank");

// tokenize the prompts and trim
std::vector<std::vector<int32_t>> inputs;
for (const auto & prompt : prompts) {
    std::vector<llama_token> inp;

    // split classification pairs and insert expected separator tokens
    if (pooling_type == LLAMA_POOLING_TYPE_RANK && prompt.find(params.cls_sep) != std::string::npos) {
        std::vector<std::string> pairs = split_lines(prompt, params.cls_sep);
        if (rerank_prompt != nullptr) {
            const std::string query = pairs[0];
            const std::string doc = pairs[1];
            std::string final_prompt = rerank_prompt;
            string_replace_all(final_prompt, "{query}"   , query);
            string_replace_all(final_prompt, "{document}", doc  );
            inp = common_tokenize(vocab, final_prompt, true, true);
        } else {
            std::string final_prompt;
            for (size_t i = 0; i < pairs.size(); i++) {
                final_prompt += pairs[i];
                if (i != pairs.size() - 1) {
                    if (!added_eos_token.empty()) final_prompt += added_eos_token;
                    if (!added_sep_token.empty()) final_prompt += added_sep_token;
                }
            }
            inp = common_tokenize(ctx, final_prompt, true, true);
        }
    } else {
        inp = common_tokenize(ctx, prompt, true, true);
    }

    if (inp.size() > n_batch) {
        LOG_ERR("%s: number of tokens in input line (%lld) exceeds batch size (%lld)\n",
                __func__, (long long int) inp.size(), (long long int) n_batch);
        return 1;
    }
    inputs.push_back(inp);
}

EOS/SEP token verification:

// check if the last token is SEP/EOS
for (auto & inp : inputs) {
    if (inp.empty() || (inp.back() != llama_vocab_sep(vocab) && inp.back() != llama_vocab_eos(vocab))) {
        LOG_WRN("%s: last token in the prompt is not SEP or EOS\n", __func__);
        LOG_WRN("%s: 'tokenizer.ggml.add_eos_token' should be set to 'true' in the GGUF header\n", __func__);
    }
}

I/O Contract

Direction	Description
Input	Single string (`params.prompt`) containing one or more texts separated by `params.embd_sep` (default: `"\n"`)
Output	`std::vector<std::vector<int32_t>>` -- a vector of tokenized prompts ready for batch processing
Preconditions	Model and vocabulary must be loaded; `n_batch` must be set
Error Handling	Returns error code 1 if any tokenized prompt exceeds `n_batch`; warns if EOS/SEP token is missing from tokenized output

Processing pipeline:

Step	Operation	Output
1	Split input string by separator	`std::vector<std::string>` of individual prompts
2	Detect reranking pairs (if RANK pooling)	Query/document pairs identified by `cls_sep`
3	Apply rerank template or insert separator tokens	Formatted prompt string with special tokens
4	Tokenize each prompt	`std::vector<int32_t>` token sequences
5	Validate token count vs. batch size	Error if any sequence exceeds limit
6	Verify EOS/SEP token presence	Warning if missing

Usage Examples

Simple multi-line embedding:

./llama-embedding -m bge-small.gguf -p "The cat sat on the mat
The dog ran in the park
Machine learning is fascinating"

This produces three separate embedding vectors, one per line.

Reranking with query-document pairs:

./llama-embedding -m reranker.gguf --reranking --cls-sep "\t" \
  -p "what is machine learning\tMachine learning is a subset of AI
what is machine learning\tThe weather today is sunny"

Each line contains a query-document pair separated by tab. The reranker assigns relevance scores.

JSON output format:

./llama-embedding -m model.gguf --embd-out json -p "Hello world\nGoodbye world"

Produces OpenAI-compatible JSON with embedding arrays.

Related Pages

Principle:Ggml_org_Llama_cpp_Input_Text_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment