Implementation:Ggml org Llama cpp Server Main

Field	Value
Implementation Name	Server Main
Doc Type	Wrapper Doc
Domain	Server Lifecycle, Process Entry Point
Description	Server main function implementing startup sequence: parameter validation, HTTP binding, model loading, route registration, and event loop
Related Workflow	OpenAI_Compatible_Server

Overview

Description

The Server Main implementation is the entry point for the llama-server process. It orchestrates the complete server lifecycle from argument parsing through graceful shutdown. The function supports two operational modes: single-model mode (direct inference) and router mode (multi-model proxy), determined by whether a model path is provided.

Usage

The main function is invoked when the llama-server binary is executed:

# Single-model mode
llama-server --model model.gguf --host 0.0.0.0 --port 8080

# Router mode (no model path - experimental)
llama-server --host 0.0.0.0 --port 8080

Code Reference

Field	Value
Source Location	`tools/server/server.cpp:69-322`
Signature	`int main(int argc, char ** argv)`
Import	Entry point; links against `server-context`, `common`, `cpp-httplib`

Parameter validation and backend initialization:

int main(int argc, char ** argv) {
    common_params params;

    if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_SERVER)) {
        return 1;
    }

    // validate batch size for embeddings
    if (params.embedding && params.n_batch > params.n_ubatch) {
        LOG_WRN("%s: setting n_batch = n_ubatch = %d to avoid assertion failure\n", __func__, params.n_ubatch);
        params.n_batch = params.n_ubatch;
    }

    if (params.n_parallel < 0) {
        params.n_parallel = 4;
        params.kv_unified = true;
    }

    common_init();
    server_context ctx_server;
    llama_backend_init();
    llama_numa_init(params.numa);

HTTP server initialization and route registration:

    server_http_context ctx_http;
    if (!ctx_http.init(params)) {
        return 1;
    }

    server_routes routes(params, ctx_server);

    // Register all API routes
    ctx_http.get ("/health",              ex_wrapper(routes.get_health));
    ctx_http.get ("/metrics",             ex_wrapper(routes.get_metrics));
    ctx_http.post("/v1/chat/completions", ex_wrapper(routes.post_chat_completions));
    ctx_http.post("/v1/completions",      ex_wrapper(routes.post_completions_oai));
    ctx_http.post("/v1/embeddings",       ex_wrapper(routes.post_embeddings_oai));
    ctx_http.post("/v1/responses",        ex_wrapper(routes.post_responses_oai));
    ctx_http.post("/v1/messages",         ex_wrapper(routes.post_anthropic_messages));
    // ... additional routes

Model loading and main loop (single-model mode):

    // Start HTTP server before model loading (enables /health during load)
    if (!ctx_http.start()) {
        clean_up();
        return 1;
    }

    // Load the model
    if (!ctx_server.load_model(params)) {
        clean_up();
        return 1;
    }

    routes.update_meta(ctx_server);
    ctx_http.is_ready.store(true);

    // Enter blocking main loop
    ctx_server.start_loop();

    clean_up();

Signal handling:

#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
    struct sigaction sigint_action;
    sigint_action.sa_handler = signal_handler;
    sigemptyset (&sigint_action.sa_mask);
    sigint_action.sa_flags = 0;
    sigaction(SIGINT, &sigint_action, NULL);
    sigaction(SIGTERM, &sigint_action, NULL);
#elif defined (_WIN32)
    auto console_ctrl_handler = +[](DWORD ctrl_type) -> BOOL {
        return (ctrl_type == CTRL_C_EVENT) ? (signal_handler(SIGINT), true) : false;
    };
    SetConsoleCtrlHandler(reinterpret_cast<PHANDLER_ROUTINE>(console_ctrl_handler), true);
#endif

I/O Contract

Direction	Description
Input	CLI arguments and environment variables defining server configuration
Output	Running HTTP server bound to configured host:port, serving inference API endpoints
Preconditions	Valid model file path (single-model mode) or empty path (router mode); available port
Exit Codes	0 = clean shutdown; 1 = initialization failure (HTTP bind error, model load error, argument parse error)
Side Effects	Binds TCP/UNIX socket; loads model weights into memory (potentially multi-GB); spawns HTTP threads and inference processing thread

Usage Examples

Complete startup sequence (single-model mode):

$ llama-server --model llama-3.gguf --host 0.0.0.0 --port 8080 --metrics --slots
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
main: loading model
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...

Router mode startup (experimental):

$ llama-server --host 0.0.0.0 --port 8080
main: starting router server, no model will be loaded in this process
main: router server is listening on http://0.0.0.0:8080
main: NOTE: router mode is experimental

Registered route table (from server.cpp):

Method	Path	Handler	Auth Required
GET	`/health`, `/v1/health`	`routes.get_health`	No
GET	`/metrics`	`routes.get_metrics`	Yes
GET	`/models`, `/v1/models`	`routes.get_models`	No
POST	`/v1/chat/completions`	`routes.post_chat_completions`	Yes
POST	`/v1/completions`	`routes.post_completions_oai`	Yes
POST	`/v1/embeddings`	`routes.post_embeddings_oai`	Yes
POST	`/v1/responses`	`routes.post_responses_oai`	Yes
POST	`/v1/messages`	`routes.post_anthropic_messages`	Yes
GET	`/slots`	`routes.get_slots`	Yes

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment