Workflow:Ollama Ollama OpenAI API Compatibility
| Knowledge Sources | |
|---|---|
| Domains | LLMs, API_Integration, Inference |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
End-to-end process for using Ollama as a drop-in replacement for the OpenAI API, enabling existing OpenAI client libraries and applications to use local models without code changes.
Description
This workflow covers the integration of Ollama with applications that expect the OpenAI API format. Ollama implements HTTP middleware that translates between OpenAI-compatible request/response formats and its internal API. This enables tools like the OpenAI Python SDK, LangChain, and other frameworks to connect to a local Ollama server by simply changing the base URL. The compatibility layer supports chat completions, text completions, embeddings, model listing, and the newer Responses API, including streaming, tool calling, and structured output.
Usage
Execute this workflow when you have an existing application built against the OpenAI API and want to switch to local inference without modifying client code. Also use this when integrating with frameworks (LangChain, LlamaIndex, Vercel AI SDK) that have built-in OpenAI support but not native Ollama support.
Execution Steps
Step 1: Server Configuration
Start the Ollama server, which automatically registers both the native API routes and the OpenAI-compatible routes. The OpenAI compatibility endpoints are mounted at /v1/ prefix paths (e.g., /v1/chat/completions, /v1/models). No additional configuration is required to enable OpenAI compatibility; it is available by default on every Ollama server instance.
Key considerations:
- OpenAI endpoints are available at the same host:port as the native API
- The /v1/ prefix follows OpenAI's API versioning convention
- Both OpenAI and Anthropic compatibility layers are active simultaneously
- CORS is configured to allow cross-origin requests for browser-based clients
Step 2: Request Translation
When an OpenAI-format request arrives, the middleware translates it into Ollama's internal request format. This involves mapping OpenAI-specific fields to their Ollama equivalents: model names, message roles, tool definitions, response format constraints, and generation parameters (temperature, top_p, max_tokens, stop sequences). The translation preserves semantic meaning while adapting to Ollama's parameter naming.
Key considerations:
- Model names are passed through directly (Ollama model names work in the OpenAI model field)
- Tool/function calling definitions are mapped to Ollama's tool format
- The response_format field (JSON mode, JSON schema) maps to Ollama's format parameter
- Stream options and usage reporting flags are translated
Step 3: Internal Inference Dispatch
The translated request is dispatched to Ollama's standard inference pipeline. The scheduler loads the requested model (or reuses an already-loaded instance), constructs the prompt using the model's template, runs inference, and produces the response stream. This step is identical to processing a native Ollama API request.
Key considerations:
- All Ollama features (GPU scheduling, KV cache reuse, thinking mode) work transparently
- The middleware is stateless; each request is independently translated
- Model capabilities (vision, tools, embeddings) are respected during dispatch
Step 4: Response Translation
The Ollama response is translated back into the OpenAI response format. For streaming responses, each Ollama chunk is converted into an OpenAI Server-Sent Events (SSE) chunk with the expected data structure (choices array, delta objects, finish_reason). For non-streaming responses, the complete response is packaged as an OpenAI ChatCompletion or Completion object with usage statistics.
Key considerations:
- Streaming uses SSE format with "data: " prefix and "[DONE]" terminator
- Tool calls in the response are formatted as OpenAI tool_call objects with IDs
- Token usage statistics (prompt_tokens, completion_tokens, total_tokens) are computed
- The finish_reason field maps Ollama's done_reason to OpenAI values (stop, length, tool_calls)
Step 5: Embeddings and Model Listing
The compatibility layer also handles non-inference endpoints. The /v1/embeddings endpoint translates embedding requests to Ollama's embed API and returns vectors in OpenAI format (supporting both float and base64 encoding). The /v1/models endpoint lists available local models in the OpenAI model listing format. The /v1/models/{id} endpoint returns details for a specific model.
Key considerations:
- Embedding responses include the embedding vector, model name, and usage statistics
- Model listing returns all locally available models with their metadata
- The Responses API (/v1/responses) provides a newer interface with background execution support
- Model IDs in OpenAI format map directly to Ollama model names