Workflow:Ggml org Llama cpp OpenAI Compatible Server
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Inference, API_Server, Deployment |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
End-to-end process for deploying a GGUF model as an OpenAI-compatible HTTP server with multi-user support, streaming, and continuous batching.
Description
This workflow covers deploying a language model as an HTTP API server that implements the OpenAI API specification. The server provides endpoints for chat completions, text completions, embeddings, and reranking, making it a drop-in replacement for the OpenAI API with local model inference. It supports concurrent multi-user access through a slot-based system with continuous batching, streaming responses via Server-Sent Events, grammar-constrained output (JSON mode, function calling), multimodal inputs (images, audio), speculative decoding for faster generation, and a built-in web UI for interactive use.
Usage
Execute this workflow when you need to serve a GGUF model as an API endpoint for applications that consume the OpenAI API format. This is appropriate for local development servers, production deployments, integration with existing OpenAI-compatible clients and frameworks (LangChain, OpenAI Python SDK), or multi-user inference serving.
Execution Steps
Step 1: Build the Server Binary
Compile the llama-server binary from the llama.cpp source. The server is built as part of the standard CMake build and links against the core llama library plus an embedded HTTP server (httplib).
Key considerations:
- Enable GPU backend support (CUDA, Metal, Vulkan) at build time for GPU acceleration
- TLS/HTTPS support requires OpenSSL development libraries
- Pre-built binaries and Docker images are available as alternatives to building from source
Step 2: Configure Server Parameters
Determine the server configuration including model path, context size, number of parallel slots, host/port binding, and optional features. Key parameters control the trade-off between concurrency, memory usage, and response quality.
Key considerations:
- Number of parallel slots (--parallel) determines concurrent request capacity
- Context size is shared across all slots (total_ctx = n_ctx * n_parallel)
- GPU layer offloading (--n-gpu-layers) controls inference speed
- API key authentication can be enabled with --api-key
- Flash attention (--flash-attn) reduces memory usage per slot
Step 3: Start the Server
Launch the llama-server process with the configured parameters. The server loads the model, initializes the specified number of processing slots, and begins listening for HTTP requests on the configured host and port.
Key considerations:
- Model loading time depends on model size and storage speed
- The server logs slot allocation and memory usage on startup
- Health endpoint (/health) can be used for readiness checks
- Router mode allows serving multiple models with dynamic loading
Step 4: Send API Requests
Clients send HTTP requests to the server's API endpoints following the OpenAI API format. The server routes requests to available processing slots, handles tokenization and inference, and returns responses in the standard OpenAI JSON format.
Primary endpoints:
- POST /v1/chat/completions: Multi-turn chat with message history
- POST /v1/completions: Single-turn text completion
- POST /v1/embeddings: Generate embedding vectors
- GET /v1/models: List available models
- GET /health: Server health status
Key considerations:
- Streaming responses use Server-Sent Events (SSE) format
- Grammar constraints can be applied via the grammar or json_schema parameters
- Tool/function calling follows the OpenAI function calling protocol
- Image and audio inputs are supported for multimodal models
Step 5: Monitor and Manage
Monitor server performance and slot utilization using the built-in metrics and status endpoints. The server exposes Prometheus-compatible metrics for integration with monitoring systems.
Key considerations:
- GET /metrics provides Prometheus metrics (requests, tokens, latency)
- GET /slots shows current slot status and processing details
- The web UI at the root URL provides interactive testing
- Server supports graceful shutdown and slot management