Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Server CLI Args

From Leeroopedia
Revision as of 12:42, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ggml_org_Llama_cpp_Server_CLI_Args.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Field Value
Implementation Name Server CLI Args
Doc Type Wrapper Doc
Domain CLI Configuration, Argument Parsing
Description CLI argument parsing for llama-server configuration: host, port, model, parallelism, security, and endpoint toggles
Related Workflow OpenAI_Compatible_Server

Overview

Description

The Server CLI Args implementation defines the command-line interface for configuring llama-server. Each argument is registered as a common_arg object with a flag name, description, parser lambda, and optional environment variable binding. Arguments are scoped to the LLAMA_EXAMPLE_SERVER example type, ensuring they only appear when parsing server-specific invocations.

Usage

Arguments are passed when launching the server:

llama-server \
  --host 0.0.0.0 \
  --port 8080 \
  --model model.gguf \
  --embedding \
  --metrics \
  --api-key my-secret-key \
  --threads-http 4

Most arguments can alternatively be set via environment variables:

export LLAMA_ARG_HOST=0.0.0.0
export LLAMA_ARG_PORT=8080
export LLAMA_ARG_EMBEDDINGS=1
llama-server --model model.gguf

Code Reference

Field Value
Source Location common/arg.cpp:2772-2969
Primary Structure common_params (populated by argument parsing)
Registration Function add_opt(common_arg(...))
Import #include "arg.h"

Network arguments:

add_opt(common_arg(
    {"--host"}, "HOST",
    string_format("ip address to listen, or bind to an UNIX socket if the address ends with .sock (default: %s)", params.hostname.c_str()),
    [](common_params & params, const std::string & value) {
        params.hostname = value;
    }
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_HOST"));

add_opt(common_arg(
    {"--port"}, "PORT",
    string_format("port to listen (default: %d)", params.port),
    [](common_params & params, int value) {
        params.port = value;
    }
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_PORT"));

Embedding and reranking mode:

add_opt(common_arg(
    {"--embedding", "--embeddings"},
    string_format("restrict to only support embedding use case; use only with dedicated embedding models (default: %s)", params.embedding ? "enabled" : "disabled"),
    [](common_params & params) {
        params.embedding = true;
    }
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_DEBUG}).set_env("LLAMA_ARG_EMBEDDINGS"));

add_opt(common_arg(
    {"--rerank", "--reranking"},
    string_format("enable reranking endpoint on server (default: %s)", "disabled"),
    [](common_params & params) {
        params.embedding = true;
        params.pooling_type = LLAMA_POOLING_TYPE_RANK;
    }
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_RERANKING"));

Security arguments:

add_opt(common_arg(
    {"--api-key"}, "KEY",
    "API key to use for authentication, multiple keys can be provided as a comma-separated list (default: none)",
    [](common_params & params, const std::string & value) {
        for (const auto & key : parse_csv_row(value)) {
            if (!key.empty()) {
                params.api_keys.push_back(key);
            }
        }
    }
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_API_KEY"));

Monitoring and endpoint toggles:

add_opt(common_arg(
    {"--metrics"},
    string_format("enable prometheus compatible metrics endpoint (default: %s)", params.endpoint_metrics ? "enabled" : "disabled"),
    [](common_params & params) {
        params.endpoint_metrics = true;
    }
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_ENDPOINT_METRICS"));

add_opt(common_arg(
    {"--slots"},
    {"--no-slots"},
    string_format("expose slots monitoring endpoint (default: %s)", params.endpoint_slots ? "enabled" : "disabled"),
    [](common_params & params, bool value) {
        params.endpoint_slots = value;
    }
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_ENDPOINT_SLOTS"));

I/O Contract

Direction Description
Input Command-line arguments (argc, argv) and environment variables (LLAMA_ARG_*, LLAMA_API_KEY)
Output Populated common_params struct with all server configuration values
Preconditions Called via common_params_parse(argc, argv, params, LLAMA_EXAMPLE_SERVER)
Error Handling Returns false on parse failure; throws std::invalid_argument for invalid directory paths; throws std::runtime_error for missing key files

Complete argument table:

Flag Env Var Description
--host HOST LLAMA_ARG_HOST IP address or UNIX socket path
--port PORT LLAMA_ARG_PORT TCP port number
--path PATH LLAMA_ARG_STATIC_PATH Static file serving directory
--api-prefix PREFIX LLAMA_ARG_API_PREFIX URL prefix for API routes
--webui / --no-webui LLAMA_ARG_WEBUI Enable/disable Web UI
--embedding LLAMA_ARG_EMBEDDINGS Enable embedding mode
--rerank LLAMA_ARG_RERANKING Enable reranking endpoint
--api-key KEY LLAMA_API_KEY API authentication key(s)
--api-key-file FNAME none File containing API keys
--ssl-key-file FNAME LLAMA_ARG_SSL_KEY_FILE PEM SSL private key
--ssl-cert-file FNAME LLAMA_ARG_SSL_CERT_FILE PEM SSL certificate
-to, --timeout N LLAMA_ARG_TIMEOUT Read/write timeout in seconds
--threads-http N LLAMA_ARG_THREADS_HTTP HTTP processing threads
--cache-prompt / --no-cache-prompt LLAMA_ARG_CACHE_PROMPT Enable/disable prompt caching
--cache-reuse N LLAMA_ARG_CACHE_REUSE Min chunk size for KV cache reuse
--metrics LLAMA_ARG_ENDPOINT_METRICS Enable Prometheus metrics
--props LLAMA_ARG_ENDPOINT_PROPS Enable props modification endpoint
--slots / --no-slots LLAMA_ARG_ENDPOINT_SLOTS Enable/disable slot monitoring
--slot-save-path PATH none Directory for slot KV cache saves

Usage Examples

Minimal server launch:

llama-server --model model.gguf

Production configuration with all monitoring:

llama-server \
  --host 0.0.0.0 \
  --port 8080 \
  --model model.gguf \
  --threads-http 8 \
  --timeout 30 \
  --api-key-file /etc/llama/keys.txt \
  --ssl-key-file /etc/ssl/server.key \
  --ssl-cert-file /etc/ssl/server.crt \
  --metrics \
  --slots \
  --cache-prompt

Embedding-only server:

llama-server \
  --model embedding-model.gguf \
  --embedding \
  --port 8081

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment