Implementation:Ggml org Llama cpp Batch Header

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Batch_Processing
Last Updated	2026-02-15 00:00 GMT

Overview

Declares the internal `llama_ubatch` struct (micro-batch) and the `llama_batch_allocr` class for batch management during inference.

Description

The `llama_ubatch` struct holds pointers to token IDs, embeddings, positions, sequence IDs, and output flags for a subset of a batch, supporting multi-positional embeddings (M-RoPE). It optionally owns its data via a shared `data_t` struct. The `llama_batch_allocr` class provides the interface for initializing, validating, and splitting batches into ubatches with different strategies (simple, equal, per-sequence), tracking per-sequence position ranges and coupling information.

Usage

This is a core header used throughout the inference pipeline. The `llama_batch_allocr` class is used by memory implementations to split input batches into appropriately-sized micro-batches for graph computation.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/llama-batch.h
Lines: 1-173

Signature

struct llama_ubatch {
    bool equal_seqs() const;
    bool is_pos_2d() const;

    uint32_t n_tokens;
    uint32_t n_seq_tokens;
    uint32_t n_seqs;
    uint32_t n_seqs_unq;
    uint32_t n_pos;

    llama_token  * token;
    float        * embd;
    llama_pos    * pos;
    int32_t      * n_seq_id;
    llama_seq_id ** seq_id;
    llama_seq_id * seq_id_unq;
    int32_t      * seq_idx;
    int8_t       * output;
};

class llama_batch_allocr {
public:
    llama_batch_allocr(uint32_t n_pos_per_embd);

    bool init(const llama_batch & batch_inp, const llama_vocab & vocab,
              const llama_memory_i * memory, uint32_t n_embd,
              uint32_t n_seq_max, bool output_all);

    void split_reset();
    llama_ubatch split_simple(uint32_t n_ubatch);
    llama_ubatch split_equal(uint32_t n_ubatch, bool sequential);
    llama_ubatch split_seq(uint32_t n_ubatch);
    llama_ubatch ubatch_reserve(uint32_t n_seq_tokens, uint32_t n_seqs);
};

Import

#include "llama-batch.h"
// Dependencies:
#include "llama.h"
#include "llama-cparams.h"
#include <array>
#include <vector>
#include <set>
#include <bitset>
#include <memory>
#include <unordered_map>

I/O Contract

Inputs

Name	Type	Required	Description
batch_inp	const llama_batch &	Yes	Input batch to sanitize and split
vocab	const llama_vocab &	Yes	Vocabulary for validation
memory	const llama_memory_i *	No	Memory instance for sequence continuity checks and position determination
n_embd	uint32_t	Yes	Embedding dimension
n_seq_max	uint32_t	Yes	Maximum number of sequences
n_ubatch	uint32_t	Yes	Maximum tokens per micro-batch for split operations
output_all	bool	Yes	Whether all tokens should be marked as output

Outputs

Name	Type	Description
llama_ubatch	struct	Micro-batch containing token IDs, positions, sequence info, and output flags
out_ids	std::vector<int32_t>	Output indices in the order encountered during splitting
n_outputs	uint32_t	Number of output tokens across all ubatches

Usage Examples

#include "llama-batch.h"

llama_batch_allocr balloc(1);
balloc.init(batch, vocab, memory, n_embd, n_seq_max, false);

balloc.split_reset();
while (true) {
    llama_ubatch ubatch = balloc.split_simple(n_ubatch);
    if (ubatch.n_tokens == 0) break;
    // process ubatch...
}

Related Pages

Principle:Ggml_org_Llama_cpp_BatchProcessing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment