Principle:Ggml org Llama cpp Evaluation Model Loading

Aspect	Detail
Principle Name	Evaluation Model Loading
Domain	Model Perplexity Evaluation
Scope	Loading models for evaluation: context configuration for perplexity computation
Related Workflow	Model_Perplexity_Evaluation

Overview

Description

Loading a model for perplexity evaluation differs from loading a model for interactive inference in several key ways. The context must be configured with specific parameters that enable accurate perplexity measurement: a known and controlled context size, an appropriate batch size for efficient chunk processing, and parallel sequence support for batched evaluation. The loading process also involves backend initialization and optional NUMA optimization.

Usage

Model loading for evaluation is performed once at the start of the perplexity tool's execution. The loaded model and context are then reused across all evaluation chunks or benchmark tasks. The configuration is driven by the evaluation mode (perplexity, HellaSwag, Winogrande, or KL divergence) and the user-specified parameters.

Theoretical Basis

Context Size Considerations:

For perplexity evaluation, the context size (n_ctx) determines the window over which the model predicts the next token. A standard perplexity evaluation uses a fixed context window (e.g., 512 tokens) and slides it across the entire dataset. The model's predictions are evaluated over the second half of each window, ensuring the model always has at least half a context window of prior tokens as conditioning context.

The effective KV cache size is n_seq * n_ctx, where n_seq is the number of parallel sequences. For standard perplexity, multiple sequences can be evaluated simultaneously to improve throughput:

n_seq = max(1, n_batch / n_ctx)

This means that with a batch size of 2048 and context size of 512, four sequences are evaluated in parallel.

Backend Initialization:

Before loading any model, the GGML backend must be initialized via llama_backend_init(). This sets up the compute backend (CPU, CUDA, Metal, Vulkan) and allocates necessary resources. For systems with NUMA (Non-Uniform Memory Access) architectures, llama_numa_init() configures memory allocation policies to optimize performance.

Model Loading via common_init_from_params():

The common_init_from_params() utility function from llama.cpp's common library handles the full model loading pipeline:

Reads the GGUF model file
Applies any LoRA adapters specified in the parameters
Creates the inference context with the configured parameters
Returns a wrapper object providing access to both the model and context

This function abstracts away the details of llama_model_load_from_file() and llama_init_from_model(), providing a single entry point that respects all common parameters.

Evaluation-Specific Configuration:

Different evaluation modes require different context configurations:

Perplexity: Uses parallel sequences (n_parallel = n_seq) for throughput
HellaSwag: Requires at least 4 parallel sequences (one per ending option), with max sequence count set to 4 * max_tasks_per_batch
Winogrande: Requires at least 2 parallel sequences (one per fill-in option)
KL divergence: Uses a single sequence (n_parallel = 1)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment