Principle:Ggml org Llama cpp Server Startup
| Field | Value |
|---|---|
| Principle Name | Server Startup |
| Domain | Server Lifecycle, Process Initialization |
| Description | Theory of initializing HTTP servers with model loading, route registration, and slot management |
| Related Workflow | OpenAI_Compatible_Server |
Overview
Description
Server startup encompasses the sequence of operations required to bring an inference server from process launch to a ready state capable of serving requests. The Server Startup principle defines the theory behind the initialization order, the distinction between router and single-model modes, and the graceful handling of failures during startup.
The startup sequence involves several critical phases:
- Parameter validation: Verifying configuration consistency (e.g., batch size constraints for embeddings, parallelism auto-detection).
- Backend initialization: Initializing the llama backend and NUMA configuration before any model operations.
- HTTP server bootstrap: Starting the HTTP listener before model loading so that health check endpoints can respond during the potentially long model loading phase.
- Model loading: Loading the model weights and initializing the inference context, which may take significant time for large models.
- Route registration: Binding handler functions to URL paths, establishing the API surface area.
- Signal handling: Installing platform-specific signal handlers (SIGINT/SIGTERM on UNIX, ConsoleCtrlHandler on Windows) for graceful shutdown.
- Main loop entry: Entering the blocking event loop that processes inference requests from the task queue.
Usage
The startup principle applies every time llama-server is launched. Understanding the startup sequence is essential for:
- Diagnosing why a server is not accepting requests (model still loading vs. HTTP bind failure)
- Understanding why health endpoints respond before inference endpoints
- Configuring load balancers and orchestration systems that need to distinguish between "process running" and "model ready"
- Choosing between router mode (multi-model proxy) and single-model mode
Theoretical Basis
Early HTTP binding is a deliberate design choice that separates HTTP transport readiness from inference readiness. By starting the HTTP server before loading the model, the server can respond to /health requests with loading status, enabling external health checkers and load balancers to track startup progress. The ctx_http.is_ready flag transitions to true only after model loading succeeds.
Dual-mode architecture supports two fundamentally different server configurations from a single binary. In single-model mode, the server loads one model and processes requests directly through the server context's task queue. In router mode (when no model path is specified), the server acts as a reverse proxy, delegating requests to child server instances that each manage their own models. This is reflected in the startup code's branching logic around is_router_server.
Ordered cleanup ensures resources are released in the correct order during both normal shutdown and error paths. The clean_up lambda captures the relevant context references and is invoked on all exit paths, including signal-triggered shutdowns. The cleanup order (stop HTTP, terminate server context, free backend) reverses the initialization order.
Blocking main loop design keeps the main thread occupied with the inference task processing loop (ctx_server.start_loop()), which is unblocked by the signal handler calling ctx_server.terminate(). This avoids busy-waiting and ensures the process exits cleanly when interrupted.