Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Ggml org Llama cpp OpenAI Compatible Server

From Leeroopedia
Knowledge Sources
Domains LLMs, Inference, API_Server, Deployment
Last Updated 2026-02-14 22:00 GMT

Overview

End-to-end process for deploying a GGUF model as an OpenAI-compatible HTTP server with multi-user support, streaming, and continuous batching.

Description

This workflow covers deploying a language model as an HTTP API server that implements the OpenAI API specification. The server provides endpoints for chat completions, text completions, embeddings, and reranking, making it a drop-in replacement for the OpenAI API with local model inference. It supports concurrent multi-user access through a slot-based system with continuous batching, streaming responses via Server-Sent Events, grammar-constrained output (JSON mode, function calling), multimodal inputs (images, audio), speculative decoding for faster generation, and a built-in web UI for interactive use.

Usage

Execute this workflow when you need to serve a GGUF model as an API endpoint for applications that consume the OpenAI API format. This is appropriate for local development servers, production deployments, integration with existing OpenAI-compatible clients and frameworks (LangChain, OpenAI Python SDK), or multi-user inference serving.

Execution Steps

Step 1: Build the Server Binary

Compile the llama-server binary from the llama.cpp source. The server is built as part of the standard CMake build and links against the core llama library plus an embedded HTTP server (httplib).

Key considerations:

  • Enable GPU backend support (CUDA, Metal, Vulkan) at build time for GPU acceleration
  • TLS/HTTPS support requires OpenSSL development libraries
  • Pre-built binaries and Docker images are available as alternatives to building from source

Step 2: Configure Server Parameters

Determine the server configuration including model path, context size, number of parallel slots, host/port binding, and optional features. Key parameters control the trade-off between concurrency, memory usage, and response quality.

Key considerations:

  • Number of parallel slots (--parallel) determines concurrent request capacity
  • Context size is shared across all slots (total_ctx = n_ctx * n_parallel)
  • GPU layer offloading (--n-gpu-layers) controls inference speed
  • API key authentication can be enabled with --api-key
  • Flash attention (--flash-attn) reduces memory usage per slot

Step 3: Start the Server

Launch the llama-server process with the configured parameters. The server loads the model, initializes the specified number of processing slots, and begins listening for HTTP requests on the configured host and port.

Key considerations:

  • Model loading time depends on model size and storage speed
  • The server logs slot allocation and memory usage on startup
  • Health endpoint (/health) can be used for readiness checks
  • Router mode allows serving multiple models with dynamic loading

Step 4: Send API Requests

Clients send HTTP requests to the server's API endpoints following the OpenAI API format. The server routes requests to available processing slots, handles tokenization and inference, and returns responses in the standard OpenAI JSON format.

Primary endpoints:

  • POST /v1/chat/completions: Multi-turn chat with message history
  • POST /v1/completions: Single-turn text completion
  • POST /v1/embeddings: Generate embedding vectors
  • GET /v1/models: List available models
  • GET /health: Server health status

Key considerations:

  • Streaming responses use Server-Sent Events (SSE) format
  • Grammar constraints can be applied via the grammar or json_schema parameters
  • Tool/function calling follows the OpenAI function calling protocol
  • Image and audio inputs are supported for multimodal models

Step 5: Monitor and Manage

Monitor server performance and slot utilization using the built-in metrics and status endpoints. The server exposes Prometheus-compatible metrics for integration with monitoring systems.

Key considerations:

  • GET /metrics provides Prometheus metrics (requests, tokens, latency)
  • GET /slots shows current slot status and processing details
  • The web UI at the root URL provides interactive testing
  • Server supports graceful shutdown and slot management

Execution Diagram

GitHub URL

Workflow Repository