Implementation:Ggml org Llama cpp Server Chat Completions

Field	Value
Implementation Name	Server Chat Completions
Doc Type	API Doc
Domain	REST API, OpenAI Compatibility
Description	OpenAI-compatible API endpoints: `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, and multi-provider translation
Related Workflow	OpenAI_Compatible_Server (CORE)

Overview

Description

The Server Chat Completions implementation defines the core API handler lambdas registered in server_routes::init_routes(). These handlers parse incoming HTTP requests, translate them into the internal task representation, submit them to the inference task queue, and format responses in the appropriate provider format. The implementation supports OpenAI, Anthropic, and Ollama request formats through protocol translation.

Usage

Clients interact with these endpoints using standard HTTP requests:

# OpenAI chat completions
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama","messages":[{"role":"user","content":"Hello"}]}'

# OpenAI completions
curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"llama","prompt":"Once upon a time"}'

# OpenAI embeddings
curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"model":"llama","input":"Hello world"}'

Code Reference

Field	Value
Source Location	`tools/server/server-context.cpp:3179-3907`
Entry Function	`void server_routes::init_routes()`
Import	Defined within server-context static library; handlers are lambda members of `server_routes`

Chat completions handler (/v1/chat/completions):

this->post_chat_completions = [this](const server_http_req & req) {
    auto res = create_response();
    std::vector<raw_buffer> files;
    json body = json::parse(req.body);
    json body_parsed = oaicompat_chat_params_parse(
        body,
        meta->chat_params,
        files);
    return handle_completions_impl(
        req,
        SERVER_TASK_TYPE_COMPLETION,
        body_parsed,
        files,
        TASK_RESPONSE_TYPE_OAI_CHAT);
};

Text completions handler (/v1/completions):

this->post_completions_oai = [this](const server_http_req & req) {
    auto res = create_response();
    std::vector<raw_buffer> files; // dummy
    const json body = json::parse(req.body);
    return handle_completions_impl(
        req,
        SERVER_TASK_TYPE_COMPLETION,
        body,
        files,
        TASK_RESPONSE_TYPE_OAI_CMPL);
};

Embeddings handler (/v1/embeddings):

this->post_embeddings_oai = [this](const server_http_req & req) {
    return handle_embeddings_impl(req, TASK_RESPONSE_TYPE_OAI_EMBD);
};

Anthropic Messages API translation (/v1/messages):

this->post_anthropic_messages = [this](const server_http_req & req) {
    auto res = create_response();
    std::vector<raw_buffer> files;
    json body = convert_anthropic_to_oai(json::parse(req.body));
    json body_parsed = oaicompat_chat_params_parse(
        body,
        meta->chat_params,
        files);
    return handle_completions_impl(
        req,
        SERVER_TASK_TYPE_COMPLETION,
        body_parsed,
        files,
        TASK_RESPONSE_TYPE_ANTHROPIC);
};

OpenAI Responses API translation (/v1/responses):

this->post_responses_oai = [this](const server_http_req & req) {
    auto res = create_response();
    std::vector<raw_buffer> files;
    json body = convert_responses_to_chatcmpl(json::parse(req.body));
    json body_parsed = oaicompat_chat_params_parse(
        body,
        meta->chat_params,
        files);
    return handle_completions_impl(
        req,
        SERVER_TASK_TYPE_COMPLETION,
        body_parsed,
        files,
        TASK_RESPONSE_TYPE_OAI_RESP);
};

I/O Contract

Endpoint	Method	Request Body	Response Format
`/v1/chat/completions`	POST	`{"model": str, "messages": [...], "stream": bool, "temperature": float, ...}`	OpenAI ChatCompletion object or SSE stream
`/v1/completions`	POST	`{"model": str, "prompt": str, "max_tokens": int, ...}`	OpenAI Completion object or SSE stream
`/v1/embeddings`	POST	`{"model": str, "input": str or [str], ...}`	OpenAI Embedding object with `data[].embedding` arrays
`/v1/messages`	POST	Anthropic Messages format	Anthropic Messages response format
`/v1/responses`	POST	OpenAI Responses format	OpenAI Responses response format
`/chat/completions`	POST	Same as `/v1/chat/completions`	Same as `/v1/chat/completions`
`/api/chat`	POST	Ollama chat format	Same as `/v1/chat/completions` (Ollama-compatible)

Response type tags:

Tag	Description
`TASK_RESPONSE_TYPE_NONE`	Native llama.cpp response format (legacy endpoints)
`TASK_RESPONSE_TYPE_OAI_CHAT`	OpenAI Chat Completions format
`TASK_RESPONSE_TYPE_OAI_CMPL`	OpenAI Completions format
`TASK_RESPONSE_TYPE_OAI_EMBD`	OpenAI Embeddings format
`TASK_RESPONSE_TYPE_ANTHROPIC`	Anthropic Messages format
`TASK_RESPONSE_TYPE_OAI_RESP`	OpenAI Responses format

Usage Examples

Chat completion with streaming:

curl -N http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    "stream": true,
    "temperature": 0.7
  }'

Anthropic-format request (translated internally):

curl http://localhost:8080/v1/messages \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "max_tokens": 256,
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

Token counting (Anthropic-compatible):

curl http://localhost:8080/v1/messages/count_tokens \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "messages": [
      {"role": "user", "content": "Count my tokens"}
    ]
  }'
# Response: {"input_tokens": 5}

Embedding extraction via OpenAI-compatible endpoint:

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "embedding-model",
    "input": ["Hello world", "Goodbye world"]
  }'

Related Pages

Principle:Ggml_org_Llama_cpp_OpenAI_API_Endpoints

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment