Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ggml org Llama cpp Server Startup

From Leeroopedia
Field Value
Principle Name Server Startup
Domain Server Lifecycle, Process Initialization
Description Theory of initializing HTTP servers with model loading, route registration, and slot management
Related Workflow OpenAI_Compatible_Server

Overview

Description

Server startup encompasses the sequence of operations required to bring an inference server from process launch to a ready state capable of serving requests. The Server Startup principle defines the theory behind the initialization order, the distinction between router and single-model modes, and the graceful handling of failures during startup.

The startup sequence involves several critical phases:

  • Parameter validation: Verifying configuration consistency (e.g., batch size constraints for embeddings, parallelism auto-detection).
  • Backend initialization: Initializing the llama backend and NUMA configuration before any model operations.
  • HTTP server bootstrap: Starting the HTTP listener before model loading so that health check endpoints can respond during the potentially long model loading phase.
  • Model loading: Loading the model weights and initializing the inference context, which may take significant time for large models.
  • Route registration: Binding handler functions to URL paths, establishing the API surface area.
  • Signal handling: Installing platform-specific signal handlers (SIGINT/SIGTERM on UNIX, ConsoleCtrlHandler on Windows) for graceful shutdown.
  • Main loop entry: Entering the blocking event loop that processes inference requests from the task queue.

Usage

The startup principle applies every time llama-server is launched. Understanding the startup sequence is essential for:

  • Diagnosing why a server is not accepting requests (model still loading vs. HTTP bind failure)
  • Understanding why health endpoints respond before inference endpoints
  • Configuring load balancers and orchestration systems that need to distinguish between "process running" and "model ready"
  • Choosing between router mode (multi-model proxy) and single-model mode

Theoretical Basis

Early HTTP binding is a deliberate design choice that separates HTTP transport readiness from inference readiness. By starting the HTTP server before loading the model, the server can respond to /health requests with loading status, enabling external health checkers and load balancers to track startup progress. The ctx_http.is_ready flag transitions to true only after model loading succeeds.

Dual-mode architecture supports two fundamentally different server configurations from a single binary. In single-model mode, the server loads one model and processes requests directly through the server context's task queue. In router mode (when no model path is specified), the server acts as a reverse proxy, delegating requests to child server instances that each manage their own models. This is reflected in the startup code's branching logic around is_router_server.

Ordered cleanup ensures resources are released in the correct order during both normal shutdown and error paths. The clean_up lambda captures the relevant context references and is invoked on all exit paths, including signal-triggered shutdowns. The cleanup order (stop HTTP, terminate server context, free backend) reverses the initialization order.

Blocking main loop design keeps the main thread occupied with the inference task processing loop (ctx_server.start_loop()), which is unblocked by the signal handler calling ctx_server.terminate(). This avoids busy-waiting and ensures the process exits cleanly when interrupted.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment