Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ggml org Llama cpp Server Configuration

From Leeroopedia
Field Value
Principle Name Server Configuration
Domain Server Administration, CLI Parameter Management
Description Theory of configuring inference server parameters: host, port, model, parallelism, and security
Related Workflow OpenAI_Compatible_Server

Overview

Description

Configuring an inference server requires mapping a rich set of operational parameters into a consistent runtime state. The Server Configuration principle addresses the theory of how command-line arguments, environment variables, and default values combine to define the behavior of an HTTP inference server.

The configuration space spans several categories:

  • Network parameters: Host address (including UNIX socket support), port number, TLS certificate and key files, read/write timeouts, and HTTP thread count.
  • Model parameters: Model file path, model alias, embedding mode flag, reranking mode, and pooling type configuration.
  • Parallelism parameters: Number of parallel request slots, batch sizes, cache prompt behavior, and cache reuse chunk sizes.
  • Security parameters: API key authentication (single key, comma-separated list, or key file), with support for both CLI flags and environment variables.
  • Endpoint toggles: Selective enabling of monitoring endpoints (metrics, slots, props) and the Web UI.
  • Content serving: Static file path for serving custom frontends, API prefix for URL namespacing, and WebUI configuration overrides.

Usage

Server configuration is applied at startup time when launching llama-server. Each parameter can typically be set via a CLI flag (e.g., --host, --port) or an environment variable (e.g., LLAMA_ARG_HOST, LLAMA_ARG_PORT). This dual-source design supports both interactive use and container/orchestration deployment patterns where environment variables are preferred.

Theoretical Basis

The theory behind inference server configuration rests on several design principles:

Layered configuration precedence defines a clear hierarchy: CLI arguments override environment variables, which override compiled defaults. This pattern (common in twelve-factor applications) ensures predictable behavior while supporting diverse deployment contexts.

Fail-fast validation catches configuration errors at startup rather than at request time. For example, if --slot-save-path specifies a non-existent directory, the server terminates immediately with an informative error rather than failing silently when a slot save is attempted.

Environment variable mapping follows a consistent naming convention (LLAMA_ARG_* prefix) that makes configuration discoverable and automatable. This is critical for container orchestration where CLI arguments may be harder to manage than environment injection.

Sensible defaults with explicit overrides ensures the server works out-of-the-box for common use cases (single model, localhost binding, no authentication) while providing fine-grained control for production deployments. The default parallelism auto-detection (n_parallel = 4 with kv_unified = true) balances throughput and memory usage without requiring manual tuning.

Security by opt-in means sensitive features like the metrics endpoint, slots monitoring, and property modification are disabled by default and must be explicitly enabled. API key authentication, when configured, applies globally to protected endpoints while leaving health checks and model listing publicly accessible.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment