Principle:Predibase Lorax Chat Completion Generation

Knowledge Sources	OpenAI Chat Completions
Domains	Text_Generation, API_Compatibility
Last Updated	2026-02-08 02:00 GMT

Overview

A text generation endpoint that processes chat completion requests through the full inference pipeline, returning OpenAI-format responses with support for both streaming and non-streaming modes.

Description

Chat Completion Generation is the core execution step of the OpenAI-compatible API. It:

Receives a ChatCompletionRequest with messages and parameters
Extracts the adapter ID from the model field
Renders the chat template into a prompt string
Sends the prompt through the inference engine (batching, LoRA application, decoding)
Formats the output as an OpenAI-compatible ChatCompletionResponse or ChatCompletionStreamResponse

Streaming uses Server-Sent Events with delta objects, matching the OpenAI streaming format. The stream ends with a [DONE] sentinel.

Usage

Use when making chat completion API calls. The endpoint supports all standard OpenAI parameters (temperature, top_p, max_tokens, stop, seed) plus LoRAX-specific extensions (adapter_source, api_token).

Theoretical Basis

Pseudo-code:

# Chat completion pipeline
def chat_completions(request):
    params = request.try_into_generate()
    prompt = apply_chat_template(request.messages)
    if request.stream:
        for token in infer.generate_stream(prompt, params):
            yield ChatCompletionStreamResponse(delta=token)
        yield "[DONE]"
    else:
        result = infer.generate(prompt, params)
        return ChatCompletionResponse(choices=[result])

Related Pages

Implemented By

Implementation:Predibase_Lorax_Chat_Completions_Handler

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment