Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Ggml org Llama cpp Batch Size BLAS Minimum

From Leeroopedia
Knowledge Sources
Domains Optimization, Prompt_Processing
Last Updated 2026-02-14 22:00 GMT

Overview

Batch size constraint: n_batch and n_ubatch must be >= 32 to use BLAS acceleration for prompt processing.

Description

llama.cpp uses BLAS (Basic Linear Algebra Subprograms) libraries for efficient matrix multiplication during prompt processing. On CPU, this means libraries like OpenBLAS or Apple Accelerate; on GPU, this means cuBLAS or equivalent. However, BLAS routines have a minimum batch size threshold of 32 tokens. Below this threshold, the simpler non-BLAS code path is used, which can be significantly slower for prompt ingestion.

The default values are n_batch = 2048 (logical) and n_ubatch = 512 (physical), which are well above the minimum. The constraint applies to n_ubatch (the physical batch size actually sent to the backend).

Usage

Use this heuristic when tuning batch sizes for prompt processing speed. This is relevant when reducing batch sizes to save memory or when debugging slow prompt ingestion. The constraint: n_ubatch <= n_batch and both must be >= 32.

The Insight (Rule of Thumb)

  • Action: Ensure -b (n_batch) and -ub (n_ubatch) are both >= 32.
  • Value: Default n_batch = 2048, n_ubatch = 512. Good for most use cases.
  • Trade-off: Larger batches use more memory but process prompts faster. Smaller batches save memory at the cost of slower prompt ingestion.
  • Special case: For embedding extraction, n_batch is forced equal to n_ubatch to avoid assertion failures.

Reasoning

The code comments in common/common.h:362-363 explicitly state the constraint:

int32_t n_batch  = 2048; // logical batch size for prompt processing (must be >=32 to use BLAS)
int32_t n_ubatch =  512; // physical batch size for prompt processing (must be >=32 to use BLAS)

The causal attention constraint from src/llama-context.cpp:152 further limits batch size:

cparams.n_batch = cparams.causal_attn
    ? std::min(cparams.n_ctx, params.n_batch)
    : params.n_batch;
cparams.n_ubatch = std::min(cparams.n_batch,
    params.n_ubatch == 0 ? params.n_batch : params.n_ubatch);

For embeddings, the code forces equal batch sizes to prevent assertion failures (see GitHub issue #12836).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment