Implementation:Ggml org Llama cpp Batch
| Knowledge Sources | |
|---|---|
| Domains | Batch_Processing |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Implements batch allocation, validation, and splitting logic for processing token batches during inference in llama.cpp.
Description
The `llama_batch_allocr` class validates input batches (checking token IDs against vocabulary size, validating sequence IDs), auto-generates missing fields (positions, sequence IDs, output flags), tracks per-sequence position sets, and provides three splitting strategies: `split_simple` (arbitrary token groups), `split_equal` (equal-length sequence sets for efficient batched processing), and `split_seq` (one sequence-set per ubatch). Builds `llama_ubatch` objects that hold the actual data pointers consumed by the compute graph.
Usage
This is a core internal component that mediates between the user-facing `llama_batch` API and the internal `llama_ubatch` format. It is used automatically during `llama_decode()` and `llama_encode()` calls to prepare token data for the compute graph and KV cache.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/llama-batch.cpp
- Lines: 1-917
Signature
class llama_batch_allocr {
public:
llama_batch_allocr(uint32_t n_pos_per_embd);
// Initialize and validate a batch
bool init(
const llama_batch & batch_inp,
const llama_vocab & vocab,
const llama_memory_i * memory,
uint32_t n_embd,
uint32_t n_seq_max,
bool output_all);
// Splitting strategies
llama_ubatch split_simple(uint32_t n_ubatch);
llama_ubatch split_equal(uint32_t n_ubatch);
llama_ubatch split_seq(uint32_t n_ubatch);
// State queries
bool get_ubatch(llama_ubatch & ubatch) const;
int64_t n_tokens() const;
void clear();
private:
uint32_t n_pos_per_embd;
int debug;
llama_batch batch;
const llama_vocab * vocab;
std::vector<std::set<llama_pos>> seq_pos;
std::vector<std::vector<bool>> seq_cpl;
std::vector<int32_t> seq_idx;
// ... additional internal state
};
Import
#include "llama-batch.h"
#include "llama-impl.h"
#include "llama-vocab.h"
#include "llama-memory.h"
#include <cassert>
#include <cstring>
#include <algorithm>
#include <sstream>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| batch_inp | llama_batch | Yes | User-provided batch containing tokens, positions, sequence IDs, and output flags |
| vocab | llama_vocab | Yes | Vocabulary reference for token ID validation |
| memory | llama_memory_i* | No | Memory interface for sequence position tracking |
| n_embd | uint32_t | Yes | Embedding dimension size |
| n_seq_max | uint32_t | Yes | Maximum number of sequences allowed |
| output_all | bool | No | Whether to mark all tokens for output (overrides per-token logits flag) |
| n_ubatch | uint32_t | Yes | Maximum number of tokens per micro-batch for splitting |
Outputs
| Name | Type | Description |
|---|---|---|
| ubatch | llama_ubatch | Micro-batch with data pointers ready for the compute graph |
| success | bool | Whether batch initialization and validation succeeded |
| n_tokens | int64_t | Total number of tokens in the validated batch |
Usage Examples
// Internal usage within llama_context::decode()
llama_batch_allocr batch_allocr(n_pos_per_embd);
// Initialize with user batch
bool ok = batch_allocr.init(batch, vocab, memory, n_embd, n_seq_max, output_all);
if (!ok) {
return -1; // validation failed
}
// Split into micro-batches and process
while (batch_allocr.n_tokens() > 0) {
llama_ubatch ubatch = batch_allocr.split_equal(n_ubatch);
// Process ubatch through compute graph
process_ubatch(ubatch);
}