Principle:Ggml org Llama cpp OpenAI API Endpoints
| Field | Value |
|---|---|
| Principle Name | OpenAI API Endpoints |
| Domain | API Design, OpenAI Compatibility Layer |
| Description | Theory of OpenAI-compatible API design: chat completions, completions, embeddings, and multi-provider support |
| Related Workflow | OpenAI_Compatible_Server (CORE) |
Overview
Description
The OpenAI API Endpoints principle defines the theory behind implementing an API layer that is wire-compatible with the OpenAI REST API specification. This enables applications built for the OpenAI API to seamlessly switch to a locally-hosted llama.cpp server without code changes. The compatibility layer extends beyond OpenAI to also support Anthropic Messages API format and Ollama-specific endpoints.
The core API endpoints include:
- /v1/chat/completions: Chat-based text generation with message history, supporting streaming via Server-Sent Events (SSE).
- /v1/completions: Raw text completion without chat formatting, supporting both native and OAI-compatible response formats.
- /v1/embeddings: Vector embedding extraction from input text, returning dense vectors in OpenAI-compatible JSON format.
- /v1/responses: OpenAI Responses API format, internally converted to chat completions.
- /v1/messages: Anthropic Messages API format, internally converted to OpenAI chat completions for processing.
- /v1/models: Model listing endpoint returning metadata about loaded models.
Usage
These endpoints are used by any HTTP client that implements the OpenAI API protocol. Common use cases include:
- Drop-in replacement for OpenAI API calls in applications using OpenAI client libraries
- Integration with frameworks like LangChain, LlamaIndex, or AutoGen that target the OpenAI API
- Local development and testing without cloud API dependencies
- Privacy-sensitive deployments where data must not leave the local network
Theoretical Basis
Protocol translation is the central design pattern. Rather than implementing each provider's protocol natively, the server converts incoming requests from various formats (Anthropic, Ollama, OpenAI Responses) into a unified internal representation, processes them through a single inference pipeline, and then formats the response back into the expected provider format. This is implemented through functions like oaicompat_chat_params_parse(), convert_anthropic_to_oai(), and convert_responses_to_chatcmpl().
Response type tagging allows the same inference pipeline to produce different output formats. Each request carries a TASK_RESPONSE_TYPE tag (e.g., TASK_RESPONSE_TYPE_OAI_CHAT, TASK_RESPONSE_TYPE_OAI_CMPL, TASK_RESPONSE_TYPE_ANTHROPIC) that determines how the raw inference output is formatted before being sent to the client.
Chat template application bridges the gap between OpenAI's structured message format and the flat token sequences expected by language models. The server uses model-specific chat templates (Jinja-based or built-in) to convert the messages array into a properly formatted prompt string, handling system messages, tool calls, and multi-turn conversation history.
Streaming via Server-Sent Events (SSE) provides real-time token delivery for long-running generations. The server sends incremental data: chunks as tokens are generated, terminated by a data: [DONE] sentinel, matching the OpenAI streaming protocol exactly.
Unified task queue decouples HTTP request handling from inference execution. Each API request creates one or more server_task objects that are posted to a shared task queue, processed by the inference loop, and their results collected asynchronously. This design enables request batching and parallel slot utilization.