Principle:Ggml org Llama cpp Server Configuration
| Field | Value |
|---|---|
| Principle Name | Server Configuration |
| Domain | Server Administration, CLI Parameter Management |
| Description | Theory of configuring inference server parameters: host, port, model, parallelism, and security |
| Related Workflow | OpenAI_Compatible_Server |
Overview
Description
Configuring an inference server requires mapping a rich set of operational parameters into a consistent runtime state. The Server Configuration principle addresses the theory of how command-line arguments, environment variables, and default values combine to define the behavior of an HTTP inference server.
The configuration space spans several categories:
- Network parameters: Host address (including UNIX socket support), port number, TLS certificate and key files, read/write timeouts, and HTTP thread count.
- Model parameters: Model file path, model alias, embedding mode flag, reranking mode, and pooling type configuration.
- Parallelism parameters: Number of parallel request slots, batch sizes, cache prompt behavior, and cache reuse chunk sizes.
- Security parameters: API key authentication (single key, comma-separated list, or key file), with support for both CLI flags and environment variables.
- Endpoint toggles: Selective enabling of monitoring endpoints (metrics, slots, props) and the Web UI.
- Content serving: Static file path for serving custom frontends, API prefix for URL namespacing, and WebUI configuration overrides.
Usage
Server configuration is applied at startup time when launching llama-server. Each parameter can typically be set via a CLI flag (e.g., --host, --port) or an environment variable (e.g., LLAMA_ARG_HOST, LLAMA_ARG_PORT). This dual-source design supports both interactive use and container/orchestration deployment patterns where environment variables are preferred.
Theoretical Basis
The theory behind inference server configuration rests on several design principles:
Layered configuration precedence defines a clear hierarchy: CLI arguments override environment variables, which override compiled defaults. This pattern (common in twelve-factor applications) ensures predictable behavior while supporting diverse deployment contexts.
Fail-fast validation catches configuration errors at startup rather than at request time. For example, if --slot-save-path specifies a non-existent directory, the server terminates immediately with an informative error rather than failing silently when a slot save is attempted.
Environment variable mapping follows a consistent naming convention (LLAMA_ARG_* prefix) that makes configuration discoverable and automatable. This is critical for container orchestration where CLI arguments may be harder to manage than environment injection.
Sensible defaults with explicit overrides ensures the server works out-of-the-box for common use cases (single model, localhost binding, no authentication) while providing fine-grained control for production deployments. The default parallelism auto-detection (n_parallel = 4 with kv_unified = true) balances throughput and memory usage without requiring manual tuning.
Security by opt-in means sensitive features like the metrics endpoint, slots monitoring, and property modification are disabled by default and must be explicitly enabled. API key authentication, when configured, applies globally to protected endpoints while leaving health checks and model listing publicly accessible.