Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Embedding Input Splitting

From Leeroopedia
Field Value
Implementation Name Embedding Input Splitting
Doc Type Pattern Doc
Domain Text Preprocessing, Input Parsing
Description Input splitting and preprocessing pattern for embedding extraction: multi-line splitting, tokenization, and reranking pair formatting
Related Workflow Embedding_Extraction

Overview

Description

The Embedding Input Splitting implementation documents the pattern for preparing input text for batch embedding extraction as implemented in examples/embedding/embedding.cpp. The pattern involves splitting a single input string into multiple prompts using a configurable separator, tokenizing each prompt, validating token counts against batch limits, verifying EOS/SEP token presence, and handling the special case of reranking pair construction.

Usage

# Multiple prompts separated by newlines (default separator)
./llama-embedding -m model.gguf -p "First sentence
Second sentence
Third sentence"

# Custom separator
./llama-embedding -m model.gguf --embd-sep "|||" -p "First sentence|||Second sentence"

# Reranking pairs separated by cls_sep
./llama-embedding -m reranker.gguf --reranking -p "query text\tdocument text"

Code Reference

Field Value
Source Location (split_lines) examples/embedding/embedding.cpp:13-27
Source Location (tokenization loop) examples/embedding/embedding.cpp:168-207
Signature static std::vector<std::string> split_lines(const std::string & s, const std::string & separator = "\n")
Import Local static function in examples/embedding/embedding.cpp

split_lines function:

static std::vector<std::string> split_lines(const std::string & s, const std::string & separator = "\n") {
    std::vector<std::string> lines;
    size_t start = 0;
    size_t end = s.find(separator);

    while (end != std::string::npos) {
        lines.push_back(s.substr(start, end - start));
        start = end + separator.length();
        end = s.find(separator, start);
    }

    lines.push_back(s.substr(start)); // Add the last part

    return lines;
}

Prompt splitting and tokenization:

// split the prompt into lines
std::vector<std::string> prompts = split_lines(params.prompt, params.embd_sep);

// get added sep and eos token, if any
const std::string added_sep_token = llama_vocab_get_add_sep(vocab)
    ? llama_vocab_get_text(vocab, llama_vocab_sep(vocab)) : "";
const std::string added_eos_token = llama_vocab_get_add_eos(vocab)
    ? llama_vocab_get_text(vocab, llama_vocab_eos(vocab)) : "";
const char * rerank_prompt = llama_model_chat_template(model, "rerank");

// tokenize the prompts and trim
std::vector<std::vector<int32_t>> inputs;
for (const auto & prompt : prompts) {
    std::vector<llama_token> inp;

    // split classification pairs and insert expected separator tokens
    if (pooling_type == LLAMA_POOLING_TYPE_RANK && prompt.find(params.cls_sep) != std::string::npos) {
        std::vector<std::string> pairs = split_lines(prompt, params.cls_sep);
        if (rerank_prompt != nullptr) {
            const std::string query = pairs[0];
            const std::string doc = pairs[1];
            std::string final_prompt = rerank_prompt;
            string_replace_all(final_prompt, "{query}"   , query);
            string_replace_all(final_prompt, "{document}", doc  );
            inp = common_tokenize(vocab, final_prompt, true, true);
        } else {
            std::string final_prompt;
            for (size_t i = 0; i < pairs.size(); i++) {
                final_prompt += pairs[i];
                if (i != pairs.size() - 1) {
                    if (!added_eos_token.empty()) final_prompt += added_eos_token;
                    if (!added_sep_token.empty()) final_prompt += added_sep_token;
                }
            }
            inp = common_tokenize(ctx, final_prompt, true, true);
        }
    } else {
        inp = common_tokenize(ctx, prompt, true, true);
    }

    if (inp.size() > n_batch) {
        LOG_ERR("%s: number of tokens in input line (%lld) exceeds batch size (%lld)\n",
                __func__, (long long int) inp.size(), (long long int) n_batch);
        return 1;
    }
    inputs.push_back(inp);
}

EOS/SEP token verification:

// check if the last token is SEP/EOS
for (auto & inp : inputs) {
    if (inp.empty() || (inp.back() != llama_vocab_sep(vocab) && inp.back() != llama_vocab_eos(vocab))) {
        LOG_WRN("%s: last token in the prompt is not SEP or EOS\n", __func__);
        LOG_WRN("%s: 'tokenizer.ggml.add_eos_token' should be set to 'true' in the GGUF header\n", __func__);
    }
}

I/O Contract

Direction Description
Input Single string (params.prompt) containing one or more texts separated by params.embd_sep (default: "\n")
Output std::vector<std::vector<int32_t>> -- a vector of tokenized prompts ready for batch processing
Preconditions Model and vocabulary must be loaded; n_batch must be set
Error Handling Returns error code 1 if any tokenized prompt exceeds n_batch; warns if EOS/SEP token is missing from tokenized output

Processing pipeline:

Step Operation Output
1 Split input string by separator std::vector<std::string> of individual prompts
2 Detect reranking pairs (if RANK pooling) Query/document pairs identified by cls_sep
3 Apply rerank template or insert separator tokens Formatted prompt string with special tokens
4 Tokenize each prompt std::vector<int32_t> token sequences
5 Validate token count vs. batch size Error if any sequence exceeds limit
6 Verify EOS/SEP token presence Warning if missing

Usage Examples

Simple multi-line embedding:

./llama-embedding -m bge-small.gguf -p "The cat sat on the mat
The dog ran in the park
Machine learning is fascinating"

This produces three separate embedding vectors, one per line.

Reranking with query-document pairs:

./llama-embedding -m reranker.gguf --reranking --cls-sep "\t" \
  -p "what is machine learning\tMachine learning is a subset of AI
what is machine learning\tThe weather today is sunny"

Each line contains a query-document pair separated by tab. The reranker assigns relevance scores.

JSON output format:

./llama-embedding -m model.gguf --embd-out json -p "Hello world\nGoodbye world"

Produces OpenAI-compatible JSON with embedding arrays.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment