Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Server Main

From Leeroopedia
Revision as of 12:42, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ggml_org_Llama_cpp_Server_Main.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Field Value
Implementation Name Server Main
Doc Type Wrapper Doc
Domain Server Lifecycle, Process Entry Point
Description Server main function implementing startup sequence: parameter validation, HTTP binding, model loading, route registration, and event loop
Related Workflow OpenAI_Compatible_Server

Overview

Description

The Server Main implementation is the entry point for the llama-server process. It orchestrates the complete server lifecycle from argument parsing through graceful shutdown. The function supports two operational modes: single-model mode (direct inference) and router mode (multi-model proxy), determined by whether a model path is provided.

Usage

The main function is invoked when the llama-server binary is executed:

# Single-model mode
llama-server --model model.gguf --host 0.0.0.0 --port 8080

# Router mode (no model path - experimental)
llama-server --host 0.0.0.0 --port 8080

Code Reference

Field Value
Source Location tools/server/server.cpp:69-322
Signature int main(int argc, char ** argv)
Import Entry point; links against server-context, common, cpp-httplib

Parameter validation and backend initialization:

int main(int argc, char ** argv) {
    common_params params;

    if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_SERVER)) {
        return 1;
    }

    // validate batch size for embeddings
    if (params.embedding && params.n_batch > params.n_ubatch) {
        LOG_WRN("%s: setting n_batch = n_ubatch = %d to avoid assertion failure\n", __func__, params.n_ubatch);
        params.n_batch = params.n_ubatch;
    }

    if (params.n_parallel < 0) {
        params.n_parallel = 4;
        params.kv_unified = true;
    }

    common_init();
    server_context ctx_server;
    llama_backend_init();
    llama_numa_init(params.numa);

HTTP server initialization and route registration:

    server_http_context ctx_http;
    if (!ctx_http.init(params)) {
        return 1;
    }

    server_routes routes(params, ctx_server);

    // Register all API routes
    ctx_http.get ("/health",              ex_wrapper(routes.get_health));
    ctx_http.get ("/metrics",             ex_wrapper(routes.get_metrics));
    ctx_http.post("/v1/chat/completions", ex_wrapper(routes.post_chat_completions));
    ctx_http.post("/v1/completions",      ex_wrapper(routes.post_completions_oai));
    ctx_http.post("/v1/embeddings",       ex_wrapper(routes.post_embeddings_oai));
    ctx_http.post("/v1/responses",        ex_wrapper(routes.post_responses_oai));
    ctx_http.post("/v1/messages",         ex_wrapper(routes.post_anthropic_messages));
    // ... additional routes

Model loading and main loop (single-model mode):

    // Start HTTP server before model loading (enables /health during load)
    if (!ctx_http.start()) {
        clean_up();
        return 1;
    }

    // Load the model
    if (!ctx_server.load_model(params)) {
        clean_up();
        return 1;
    }

    routes.update_meta(ctx_server);
    ctx_http.is_ready.store(true);

    // Enter blocking main loop
    ctx_server.start_loop();

    clean_up();

Signal handling:

#if defined (__unix__) || (defined (__APPLE__) && defined (__MACH__))
    struct sigaction sigint_action;
    sigint_action.sa_handler = signal_handler;
    sigemptyset (&sigint_action.sa_mask);
    sigint_action.sa_flags = 0;
    sigaction(SIGINT, &sigint_action, NULL);
    sigaction(SIGTERM, &sigint_action, NULL);
#elif defined (_WIN32)
    auto console_ctrl_handler = +[](DWORD ctrl_type) -> BOOL {
        return (ctrl_type == CTRL_C_EVENT) ? (signal_handler(SIGINT), true) : false;
    };
    SetConsoleCtrlHandler(reinterpret_cast<PHANDLER_ROUTINE>(console_ctrl_handler), true);
#endif

I/O Contract

Direction Description
Input CLI arguments and environment variables defining server configuration
Output Running HTTP server bound to configured host:port, serving inference API endpoints
Preconditions Valid model file path (single-model mode) or empty path (router mode); available port
Exit Codes 0 = clean shutdown; 1 = initialization failure (HTTP bind error, model load error, argument parse error)
Side Effects Binds TCP/UNIX socket; loads model weights into memory (potentially multi-GB); spawns HTTP threads and inference processing thread

Usage Examples

Complete startup sequence (single-model mode):

$ llama-server --model llama-3.gguf --host 0.0.0.0 --port 8080 --metrics --slots
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
main: loading model
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...

Router mode startup (experimental):

$ llama-server --host 0.0.0.0 --port 8080
main: starting router server, no model will be loaded in this process
main: router server is listening on http://0.0.0.0:8080
main: NOTE: router mode is experimental

Registered route table (from server.cpp):

Method Path Handler Auth Required
GET /health, /v1/health routes.get_health No
GET /metrics routes.get_metrics Yes
GET /models, /v1/models routes.get_models No
POST /v1/chat/completions routes.post_chat_completions Yes
POST /v1/completions routes.post_completions_oai Yes
POST /v1/embeddings routes.post_embeddings_oai Yes
POST /v1/responses routes.post_responses_oai Yes
POST /v1/messages routes.post_anthropic_messages Yes
GET /slots routes.get_slots Yes

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment