Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Embedding Input Splitting

From Leeroopedia
Revision as of 12:39, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ggml_org_Llama_cpp_Embedding_Input_Splitting.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Field Value
Implementation Name Embedding Input Splitting
Doc Type Pattern Doc
Domain Text Preprocessing, Input Parsing
Description Input splitting and preprocessing pattern for embedding extraction: multi-line splitting, tokenization, and reranking pair formatting
Related Workflow Embedding_Extraction

Overview

Description

The Embedding Input Splitting implementation documents the pattern for preparing input text for batch embedding extraction as implemented in examples/embedding/embedding.cpp. The pattern involves splitting a single input string into multiple prompts using a configurable separator, tokenizing each prompt, validating token counts against batch limits, verifying EOS/SEP token presence, and handling the special case of reranking pair construction.

Usage

# Multiple prompts separated by newlines (default separator)
./llama-embedding -m model.gguf -p "First sentence
Second sentence
Third sentence"

# Custom separator
./llama-embedding -m model.gguf --embd-sep "|||" -p "First sentence|||Second sentence"

# Reranking pairs separated by cls_sep
./llama-embedding -m reranker.gguf --reranking -p "query text\tdocument text"

Code Reference

Field Value
Source Location (split_lines) examples/embedding/embedding.cpp:13-27
Source Location (tokenization loop) examples/embedding/embedding.cpp:168-207
Signature static std::vector<std::string> split_lines(const std::string & s, const std::string & separator = "\n")
Import Local static function in examples/embedding/embedding.cpp

split_lines function:

static std::vector<std::string> split_lines(const std::string & s, const std::string & separator = "\n") {
    std::vector<std::string> lines;
    size_t start = 0;
    size_t end = s.find(separator);

    while (end != std::string::npos) {
        lines.push_back(s.substr(start, end - start));
        start = end + separator.length();
        end = s.find(separator, start);
    }

    lines.push_back(s.substr(start)); // Add the last part

    return lines;
}

Prompt splitting and tokenization:

// split the prompt into lines
std::vector<std::string> prompts = split_lines(params.prompt, params.embd_sep);

// get added sep and eos token, if any
const std::string added_sep_token = llama_vocab_get_add_sep(vocab)
    ? llama_vocab_get_text(vocab, llama_vocab_sep(vocab)) : "";
const std::string added_eos_token = llama_vocab_get_add_eos(vocab)
    ? llama_vocab_get_text(vocab, llama_vocab_eos(vocab)) : "";
const char * rerank_prompt = llama_model_chat_template(model, "rerank");

// tokenize the prompts and trim
std::vector<std::vector<int32_t>> inputs;
for (const auto & prompt : prompts) {
    std::vector<llama_token> inp;

    // split classification pairs and insert expected separator tokens
    if (pooling_type == LLAMA_POOLING_TYPE_RANK && prompt.find(params.cls_sep) != std::string::npos) {
        std::vector<std::string> pairs = split_lines(prompt, params.cls_sep);
        if (rerank_prompt != nullptr) {
            const std::string query = pairs[0];
            const std::string doc = pairs[1];
            std::string final_prompt = rerank_prompt;
            string_replace_all(final_prompt, "{query}"   , query);
            string_replace_all(final_prompt, "{document}", doc  );
            inp = common_tokenize(vocab, final_prompt, true, true);
        } else {
            std::string final_prompt;
            for (size_t i = 0; i < pairs.size(); i++) {
                final_prompt += pairs[i];
                if (i != pairs.size() - 1) {
                    if (!added_eos_token.empty()) final_prompt += added_eos_token;
                    if (!added_sep_token.empty()) final_prompt += added_sep_token;
                }
            }
            inp = common_tokenize(ctx, final_prompt, true, true);
        }
    } else {
        inp = common_tokenize(ctx, prompt, true, true);
    }

    if (inp.size() > n_batch) {
        LOG_ERR("%s: number of tokens in input line (%lld) exceeds batch size (%lld)\n",
                __func__, (long long int) inp.size(), (long long int) n_batch);
        return 1;
    }
    inputs.push_back(inp);
}

EOS/SEP token verification:

// check if the last token is SEP/EOS
for (auto & inp : inputs) {
    if (inp.empty() || (inp.back() != llama_vocab_sep(vocab) && inp.back() != llama_vocab_eos(vocab))) {
        LOG_WRN("%s: last token in the prompt is not SEP or EOS\n", __func__);
        LOG_WRN("%s: 'tokenizer.ggml.add_eos_token' should be set to 'true' in the GGUF header\n", __func__);
    }
}

I/O Contract

Direction Description
Input Single string (params.prompt) containing one or more texts separated by params.embd_sep (default: "\n")
Output std::vector<std::vector<int32_t>> -- a vector of tokenized prompts ready for batch processing
Preconditions Model and vocabulary must be loaded; n_batch must be set
Error Handling Returns error code 1 if any tokenized prompt exceeds n_batch; warns if EOS/SEP token is missing from tokenized output

Processing pipeline:

Step Operation Output
1 Split input string by separator std::vector<std::string> of individual prompts
2 Detect reranking pairs (if RANK pooling) Query/document pairs identified by cls_sep
3 Apply rerank template or insert separator tokens Formatted prompt string with special tokens
4 Tokenize each prompt std::vector<int32_t> token sequences
5 Validate token count vs. batch size Error if any sequence exceeds limit
6 Verify EOS/SEP token presence Warning if missing

Usage Examples

Simple multi-line embedding:

./llama-embedding -m bge-small.gguf -p "The cat sat on the mat
The dog ran in the park
Machine learning is fascinating"

This produces three separate embedding vectors, one per line.

Reranking with query-document pairs:

./llama-embedding -m reranker.gguf --reranking --cls-sep "\t" \
  -p "what is machine learning\tMachine learning is a subset of AI
what is machine learning\tThe weather today is sunny"

Each line contains a query-document pair separated by tab. The reranker assigns relevance scores.

JSON output format:

./llama-embedding -m model.gguf --embd-out json -p "Hello world\nGoodbye world"

Produces OpenAI-compatible JSON with embedding arrays.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment