Workflow:Ollama Ollama OpenAI API Compatibility

Knowledge Sources	Ollama Ollama OpenAI Compatibility OpenAI API Reference
Domains	LLMs, API_Integration, Inference
Last Updated	2026-02-14 22:00 GMT

Overview

End-to-end process for using Ollama as a drop-in replacement for the OpenAI API, enabling existing OpenAI client libraries and applications to use local models without code changes.

Description

This workflow covers the integration of Ollama with applications that expect the OpenAI API format. Ollama implements HTTP middleware that translates between OpenAI-compatible request/response formats and its internal API. This enables tools like the OpenAI Python SDK, LangChain, and other frameworks to connect to a local Ollama server by simply changing the base URL. The compatibility layer supports chat completions, text completions, embeddings, model listing, and the newer Responses API, including streaming, tool calling, and structured output.

Usage

Execute this workflow when you have an existing application built against the OpenAI API and want to switch to local inference without modifying client code. Also use this when integrating with frameworks (LangChain, LlamaIndex, Vercel AI SDK) that have built-in OpenAI support but not native Ollama support.

Execution Steps

Step 1: Server Configuration

Start the Ollama server, which automatically registers both the native API routes and the OpenAI-compatible routes. The OpenAI compatibility endpoints are mounted at /v1/ prefix paths (e.g., /v1/chat/completions, /v1/models). No additional configuration is required to enable OpenAI compatibility; it is available by default on every Ollama server instance.

Key considerations:

OpenAI endpoints are available at the same host:port as the native API
The /v1/ prefix follows OpenAI's API versioning convention
Both OpenAI and Anthropic compatibility layers are active simultaneously
CORS is configured to allow cross-origin requests for browser-based clients

Step 2: Request Translation

When an OpenAI-format request arrives, the middleware translates it into Ollama's internal request format. This involves mapping OpenAI-specific fields to their Ollama equivalents: model names, message roles, tool definitions, response format constraints, and generation parameters (temperature, top_p, max_tokens, stop sequences). The translation preserves semantic meaning while adapting to Ollama's parameter naming.

Key considerations:

Model names are passed through directly (Ollama model names work in the OpenAI model field)
Tool/function calling definitions are mapped to Ollama's tool format
The response_format field (JSON mode, JSON schema) maps to Ollama's format parameter
Stream options and usage reporting flags are translated

Step 3: Internal Inference Dispatch

The translated request is dispatched to Ollama's standard inference pipeline. The scheduler loads the requested model (or reuses an already-loaded instance), constructs the prompt using the model's template, runs inference, and produces the response stream. This step is identical to processing a native Ollama API request.

Key considerations:

All Ollama features (GPU scheduling, KV cache reuse, thinking mode) work transparently
The middleware is stateless; each request is independently translated
Model capabilities (vision, tools, embeddings) are respected during dispatch

Step 4: Response Translation

The Ollama response is translated back into the OpenAI response format. For streaming responses, each Ollama chunk is converted into an OpenAI Server-Sent Events (SSE) chunk with the expected data structure (choices array, delta objects, finish_reason). For non-streaming responses, the complete response is packaged as an OpenAI ChatCompletion or Completion object with usage statistics.

Key considerations:

Streaming uses SSE format with "data: " prefix and "[DONE]" terminator
Tool calls in the response are formatted as OpenAI tool_call objects with IDs
Token usage statistics (prompt_tokens, completion_tokens, total_tokens) are computed
The finish_reason field maps Ollama's done_reason to OpenAI values (stop, length, tool_calls)

Step 5: Embeddings and Model Listing

The compatibility layer also handles non-inference endpoints. The /v1/embeddings endpoint translates embedding requests to Ollama's embed API and returns vectors in OpenAI format (supporting both float and base64 encoding). The /v1/models endpoint lists available local models in the OpenAI model listing format. The /v1/models/{id} endpoint returns details for a specific model.

Key considerations:

Embedding responses include the embedding vector, model name, and usage statistics
Model listing returns all locally available models with their metadata
The Responses API (/v1/responses) provides a newer interface with background execution support
Model IDs in OpenAI format map directly to Ollama model names

Execution Diagram

GitHub URL

Workflow Repository