Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Server Main

From Leeroopedia
Field Value
Implementation Name Server Main
Doc Type Wrapper Doc
Domain Server Lifecycle, Process Entry Point
Description Server main function implementing startup sequence: parameter validation, HTTP binding, model loading, route registration, and event loop
Related Workflow OpenAI_Compatible_Server

Overview

Description

The Server Main implementation is the entry point for the llama-server process. It orchestrates the complete server lifecycle from argument parsing through graceful shutdown. The function supports two operational modes: single-model mode (direct inference) and router mode (multi-model proxy), determined by whether a model path is provided.

Usage

The main function is invoked when the llama-server binary is executed:

# Single-model mode
llama-server --model model.gguf --host 0.0.0.0 --port 8080

# Router mode (no model path - experimental)
llama-server --host 0.0.0.0 --port 8080

Code Reference

Field Value
Source Location tools/server/server.cpp:69-322
Signature int main(int argc, char ** argv)
Import Entry point; links against server-context, common, cpp-httplib

Parameter validation and backend initialization:

int main(int argc, char ** argv) {
    common_params params;

    if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_SERVER)) {
        return 1;
    }

    // validate batch size for embeddings
    if (params.embedding && params.n_batch > params.n_ubatch) {
        LOG_WRN("%s: setting n_batch = n_ubatch = %d to avoid assertion failure\n", __func__, params.n_ubatch);
        params.n_batch = params.n_ubatch;
    }

    if (params.n_parallel < 0) {
        params.n_parallel = 4;
        params.kv_unified = true;
    }

    common_init();
    server_context ctx_server;
    llama_backend_init();
    llama_numa_init(params.numa);

HTTP server initialization and route registration:

    server_http_context ctx_http;
    if (!ctx_http.init(params)) {
        return 1;
    }

    server_routes routes(params, ctx_server);

    // Register all API routes
    ctx_http.get ("/health",              ex_wrapper(routes.get_health));
    ctx_http.get ("/metrics",             ex_wrapper(routes.get_metrics));
    ctx_http.post("/v1/chat/completions", ex_wrapper(routes.post_chat_completions));
    ctx_http.post("/v1/completions",      ex_wrapper(routes.post_completions_oai));
    ctx_http.post("/v1/embeddings",       ex_wrapper(routes.post_embeddings_oai));
    ctx_http.post("/v1/responses",        ex_wrapper(routes.post_responses_oai));
    ctx_http.post("/v1/messages",         ex_wrapper(routes.post_anthropic_messages));
    // ... additional routes

Model loading and main loop (single-model mode):

    // Start HTTP server before model loading (enables /health during load)
    if (!ctx_http.start()) {
        clean_up();
        return 1;
    }

    // Load the model
    if (!ctx_server.load_model(params)) {
        clean_up();
        return 1;
    }

    routes.update_meta(ctx_server);
    ctx_http.is_ready.store(true);

    // Enter blocking main loop
    ctx_server.start_loop();

    clean_up();

Signal handling:

#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
    struct sigaction sigint_action;
    sigint_action.sa_handler = signal_handler;
    sigemptyset (&sigint_action.sa_mask);
    sigint_action.sa_flags = 0;
    sigaction(SIGINT, &sigint_action, NULL);
    sigaction(SIGTERM, &sigint_action, NULL);
#elif defined (_WIN32)
    auto console_ctrl_handler = +[](DWORD ctrl_type) -> BOOL {
        return (ctrl_type == CTRL_C_EVENT) ? (signal_handler(SIGINT), true) : false;
    };
    SetConsoleCtrlHandler(reinterpret_cast<PHANDLER_ROUTINE>(console_ctrl_handler), true);
#endif

I/O Contract

Direction Description
Input CLI arguments and environment variables defining server configuration
Output Running HTTP server bound to configured host:port, serving inference API endpoints
Preconditions Valid model file path (single-model mode) or empty path (router mode); available port
Exit Codes 0 = clean shutdown; 1 = initialization failure (HTTP bind error, model load error, argument parse error)
Side Effects Binds TCP/UNIX socket; loads model weights into memory (potentially multi-GB); spawns HTTP threads and inference processing thread

Usage Examples

Complete startup sequence (single-model mode):

$ llama-server --model llama-3.gguf --host 0.0.0.0 --port 8080 --metrics --slots
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
main: loading model
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...

Router mode startup (experimental):

$ llama-server --host 0.0.0.0 --port 8080
main: starting router server, no model will be loaded in this process
main: router server is listening on http://0.0.0.0:8080
main: NOTE: router mode is experimental

Registered route table (from server.cpp):

Method Path Handler Auth Required
GET /health, /v1/health routes.get_health No
GET /metrics routes.get_metrics Yes
GET /models, /v1/models routes.get_models No
POST /v1/chat/completions routes.post_chat_completions Yes
POST /v1/completions routes.post_completions_oai Yes
POST /v1/embeddings routes.post_embeddings_oai Yes
POST /v1/responses routes.post_responses_oai Yes
POST /v1/messages routes.post_anthropic_messages Yes
GET /slots routes.get_slots Yes

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment