Workflow:Ggml org Llama cpp OpenAI Compatible Server

Knowledge Sources	llama.cpp Server Documentation
Domains	LLMs, Inference, API_Server, Deployment
Last Updated	2026-02-14 22:00 GMT

Overview

End-to-end process for deploying a GGUF model as an OpenAI-compatible HTTP server with multi-user support, streaming, and continuous batching.

Description

This workflow covers deploying a language model as an HTTP API server that implements the OpenAI API specification. The server provides endpoints for chat completions, text completions, embeddings, and reranking, making it a drop-in replacement for the OpenAI API with local model inference. It supports concurrent multi-user access through a slot-based system with continuous batching, streaming responses via Server-Sent Events, grammar-constrained output (JSON mode, function calling), multimodal inputs (images, audio), speculative decoding for faster generation, and a built-in web UI for interactive use.

Usage

Execute this workflow when you need to serve a GGUF model as an API endpoint for applications that consume the OpenAI API format. This is appropriate for local development servers, production deployments, integration with existing OpenAI-compatible clients and frameworks (LangChain, OpenAI Python SDK), or multi-user inference serving.

Execution Steps

Step 1: Build the Server Binary

Compile the llama-server binary from the llama.cpp source. The server is built as part of the standard CMake build and links against the core llama library plus an embedded HTTP server (httplib).

Key considerations:

Enable GPU backend support (CUDA, Metal, Vulkan) at build time for GPU acceleration
TLS/HTTPS support requires OpenSSL development libraries
Pre-built binaries and Docker images are available as alternatives to building from source

Step 2: Configure Server Parameters

Determine the server configuration including model path, context size, number of parallel slots, host/port binding, and optional features. Key parameters control the trade-off between concurrency, memory usage, and response quality.

Key considerations:

Number of parallel slots (--parallel) determines concurrent request capacity
Context size is shared across all slots (total_ctx = n_ctx * n_parallel)
GPU layer offloading (--n-gpu-layers) controls inference speed
API key authentication can be enabled with --api-key
Flash attention (--flash-attn) reduces memory usage per slot

Step 3: Start the Server

Launch the llama-server process with the configured parameters. The server loads the model, initializes the specified number of processing slots, and begins listening for HTTP requests on the configured host and port.

Key considerations:

Model loading time depends on model size and storage speed
The server logs slot allocation and memory usage on startup
Health endpoint (/health) can be used for readiness checks
Router mode allows serving multiple models with dynamic loading

Step 4: Send API Requests

Clients send HTTP requests to the server's API endpoints following the OpenAI API format. The server routes requests to available processing slots, handles tokenization and inference, and returns responses in the standard OpenAI JSON format.

Primary endpoints:

POST /v1/chat/completions: Multi-turn chat with message history
POST /v1/completions: Single-turn text completion
POST /v1/embeddings: Generate embedding vectors
GET /v1/models: List available models
GET /health: Server health status

Key considerations:

Streaming responses use Server-Sent Events (SSE) format
Grammar constraints can be applied via the grammar or json_schema parameters
Tool/function calling follows the OpenAI function calling protocol
Image and audio inputs are supported for multimodal models

Step 5: Monitor and Manage

Monitor server performance and slot utilization using the built-in metrics and status endpoints. The server exposes Prometheus-compatible metrics for integration with monitoring systems.

Key considerations:

GET /metrics provides Prometheus metrics (requests, tokens, latency)
GET /slots shows current slot status and processing details
The web UI at the root URL provides interactive testing
Server supports graceful shutdown and slot management

Execution Diagram

GitHub URL

Workflow Repository