Implementation:Ggml org Llama cpp Server Main
| Field | Value |
|---|---|
| Implementation Name | Server Main |
| Doc Type | Wrapper Doc |
| Domain | Server Lifecycle, Process Entry Point |
| Description | Server main function implementing startup sequence: parameter validation, HTTP binding, model loading, route registration, and event loop |
| Related Workflow | OpenAI_Compatible_Server |
Overview
Description
The Server Main implementation is the entry point for the llama-server process. It orchestrates the complete server lifecycle from argument parsing through graceful shutdown. The function supports two operational modes: single-model mode (direct inference) and router mode (multi-model proxy), determined by whether a model path is provided.
Usage
The main function is invoked when the llama-server binary is executed:
# Single-model mode
llama-server --model model.gguf --host 0.0.0.0 --port 8080
# Router mode (no model path - experimental)
llama-server --host 0.0.0.0 --port 8080
Code Reference
| Field | Value |
|---|---|
| Source Location | tools/server/server.cpp:69-322
|
| Signature | int main(int argc, char ** argv)
|
| Import | Entry point; links against server-context, common, cpp-httplib
|
Parameter validation and backend initialization:
int main(int argc, char ** argv) {
common_params params;
if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_SERVER)) {
return 1;
}
// validate batch size for embeddings
if (params.embedding && params.n_batch > params.n_ubatch) {
LOG_WRN("%s: setting n_batch = n_ubatch = %d to avoid assertion failure\n", __func__, params.n_ubatch);
params.n_batch = params.n_ubatch;
}
if (params.n_parallel < 0) {
params.n_parallel = 4;
params.kv_unified = true;
}
common_init();
server_context ctx_server;
llama_backend_init();
llama_numa_init(params.numa);
HTTP server initialization and route registration:
server_http_context ctx_http;
if (!ctx_http.init(params)) {
return 1;
}
server_routes routes(params, ctx_server);
// Register all API routes
ctx_http.get ("/health", ex_wrapper(routes.get_health));
ctx_http.get ("/metrics", ex_wrapper(routes.get_metrics));
ctx_http.post("/v1/chat/completions", ex_wrapper(routes.post_chat_completions));
ctx_http.post("/v1/completions", ex_wrapper(routes.post_completions_oai));
ctx_http.post("/v1/embeddings", ex_wrapper(routes.post_embeddings_oai));
ctx_http.post("/v1/responses", ex_wrapper(routes.post_responses_oai));
ctx_http.post("/v1/messages", ex_wrapper(routes.post_anthropic_messages));
// ... additional routes
Model loading and main loop (single-model mode):
// Start HTTP server before model loading (enables /health during load)
if (!ctx_http.start()) {
clean_up();
return 1;
}
// Load the model
if (!ctx_server.load_model(params)) {
clean_up();
return 1;
}
routes.update_meta(ctx_server);
ctx_http.is_ready.store(true);
// Enter blocking main loop
ctx_server.start_loop();
clean_up();
Signal handling:
#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
struct sigaction sigint_action;
sigint_action.sa_handler = signal_handler;
sigemptyset (&sigint_action.sa_mask);
sigint_action.sa_flags = 0;
sigaction(SIGINT, &sigint_action, NULL);
sigaction(SIGTERM, &sigint_action, NULL);
#elif defined (_WIN32)
auto console_ctrl_handler = +[](DWORD ctrl_type) -> BOOL {
return (ctrl_type == CTRL_C_EVENT) ? (signal_handler(SIGINT), true) : false;
};
SetConsoleCtrlHandler(reinterpret_cast<PHANDLER_ROUTINE>(console_ctrl_handler), true);
#endif
I/O Contract
| Direction | Description |
|---|---|
| Input | CLI arguments and environment variables defining server configuration |
| Output | Running HTTP server bound to configured host:port, serving inference API endpoints |
| Preconditions | Valid model file path (single-model mode) or empty path (router mode); available port |
| Exit Codes | 0 = clean shutdown; 1 = initialization failure (HTTP bind error, model load error, argument parse error) |
| Side Effects | Binds TCP/UNIX socket; loads model weights into memory (potentially multi-GB); spawns HTTP threads and inference processing thread |
Usage Examples
Complete startup sequence (single-model mode):
$ llama-server --model llama-3.gguf --host 0.0.0.0 --port 8080 --metrics --slots
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
main: loading model
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
Router mode startup (experimental):
$ llama-server --host 0.0.0.0 --port 8080
main: starting router server, no model will be loaded in this process
main: router server is listening on http://0.0.0.0:8080
main: NOTE: router mode is experimental
Registered route table (from server.cpp):
| Method | Path | Handler | Auth Required |
|---|---|---|---|
| GET | /health, /v1/health |
routes.get_health |
No |
| GET | /metrics |
routes.get_metrics |
Yes |
| GET | /models, /v1/models |
routes.get_models |
No |
| POST | /v1/chat/completions |
routes.post_chat_completions |
Yes |
| POST | /v1/completions |
routes.post_completions_oai |
Yes |
| POST | /v1/embeddings |
routes.post_embeddings_oai |
Yes |
| POST | /v1/responses |
routes.post_responses_oai |
Yes |
| POST | /v1/messages |
routes.post_anthropic_messages |
Yes |
| GET | /slots |
routes.get_slots |
Yes |