Implementation:Ggml org Llama cpp Server Chat Completions
| Field | Value |
|---|---|
| Implementation Name | Server Chat Completions |
| Doc Type | API Doc |
| Domain | REST API, OpenAI Compatibility |
| Description | OpenAI-compatible API endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings, and multi-provider translation
|
| Related Workflow | OpenAI_Compatible_Server (CORE) |
Overview
Description
The Server Chat Completions implementation defines the core API handler lambdas registered in server_routes::init_routes(). These handlers parse incoming HTTP requests, translate them into the internal task representation, submit them to the inference task queue, and format responses in the appropriate provider format. The implementation supports OpenAI, Anthropic, and Ollama request formats through protocol translation.
Usage
Clients interact with these endpoints using standard HTTP requests:
# OpenAI chat completions
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama","messages":[{"role":"user","content":"Hello"}]}'
# OpenAI completions
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"llama","prompt":"Once upon a time"}'
# OpenAI embeddings
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"model":"llama","input":"Hello world"}'
Code Reference
| Field | Value |
|---|---|
| Source Location | tools/server/server-context.cpp:3179-3907
|
| Entry Function | void server_routes::init_routes()
|
| Import | Defined within server-context static library; handlers are lambda members of server_routes
|
Chat completions handler (/v1/chat/completions):
this->post_chat_completions = [this](const server_http_req & req) {
auto res = create_response();
std::vector<raw_buffer> files;
json body = json::parse(req.body);
json body_parsed = oaicompat_chat_params_parse(
body,
meta->chat_params,
files);
return handle_completions_impl(
req,
SERVER_TASK_TYPE_COMPLETION,
body_parsed,
files,
TASK_RESPONSE_TYPE_OAI_CHAT);
};
Text completions handler (/v1/completions):
this->post_completions_oai = [this](const server_http_req & req) {
auto res = create_response();
std::vector<raw_buffer> files; // dummy
const json body = json::parse(req.body);
return handle_completions_impl(
req,
SERVER_TASK_TYPE_COMPLETION,
body,
files,
TASK_RESPONSE_TYPE_OAI_CMPL);
};
Embeddings handler (/v1/embeddings):
this->post_embeddings_oai = [this](const server_http_req & req) {
return handle_embeddings_impl(req, TASK_RESPONSE_TYPE_OAI_EMBD);
};
Anthropic Messages API translation (/v1/messages):
this->post_anthropic_messages = [this](const server_http_req & req) {
auto res = create_response();
std::vector<raw_buffer> files;
json body = convert_anthropic_to_oai(json::parse(req.body));
json body_parsed = oaicompat_chat_params_parse(
body,
meta->chat_params,
files);
return handle_completions_impl(
req,
SERVER_TASK_TYPE_COMPLETION,
body_parsed,
files,
TASK_RESPONSE_TYPE_ANTHROPIC);
};
OpenAI Responses API translation (/v1/responses):
this->post_responses_oai = [this](const server_http_req & req) {
auto res = create_response();
std::vector<raw_buffer> files;
json body = convert_responses_to_chatcmpl(json::parse(req.body));
json body_parsed = oaicompat_chat_params_parse(
body,
meta->chat_params,
files);
return handle_completions_impl(
req,
SERVER_TASK_TYPE_COMPLETION,
body_parsed,
files,
TASK_RESPONSE_TYPE_OAI_RESP);
};
I/O Contract
| Endpoint | Method | Request Body | Response Format |
|---|---|---|---|
/v1/chat/completions |
POST | {"model": str, "messages": [...], "stream": bool, "temperature": float, ...} |
OpenAI ChatCompletion object or SSE stream |
/v1/completions |
POST | {"model": str, "prompt": str, "max_tokens": int, ...} |
OpenAI Completion object or SSE stream |
/v1/embeddings |
POST | {"model": str, "input": str or [str], ...} |
OpenAI Embedding object with data[].embedding arrays
|
/v1/messages |
POST | Anthropic Messages format | Anthropic Messages response format |
/v1/responses |
POST | OpenAI Responses format | OpenAI Responses response format |
/chat/completions |
POST | Same as /v1/chat/completions |
Same as /v1/chat/completions
|
/api/chat |
POST | Ollama chat format | Same as /v1/chat/completions (Ollama-compatible)
|
Response type tags:
| Tag | Description |
|---|---|
TASK_RESPONSE_TYPE_NONE |
Native llama.cpp response format (legacy endpoints) |
TASK_RESPONSE_TYPE_OAI_CHAT |
OpenAI Chat Completions format |
TASK_RESPONSE_TYPE_OAI_CMPL |
OpenAI Completions format |
TASK_RESPONSE_TYPE_OAI_EMBD |
OpenAI Embeddings format |
TASK_RESPONSE_TYPE_ANTHROPIC |
Anthropic Messages format |
TASK_RESPONSE_TYPE_OAI_RESP |
OpenAI Responses format |
Usage Examples
Chat completion with streaming:
curl -N http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in one sentence."}
],
"stream": true,
"temperature": 0.7
}'
Anthropic-format request (translated internally):
curl http://localhost:8080/v1/messages \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"max_tokens": 256,
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}'
Token counting (Anthropic-compatible):
curl http://localhost:8080/v1/messages/count_tokens \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"messages": [
{"role": "user", "content": "Count my tokens"}
]
}'
# Response: {"input_tokens": 5}
Embedding extraction via OpenAI-compatible endpoint:
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "embedding-model",
"input": ["Hello world", "Goodbye world"]
}'