Principle:Predibase Lorax Chat Completion Generation
| Knowledge Sources | |
|---|---|
| Domains | Text_Generation, API_Compatibility |
| Last Updated | 2026-02-08 02:00 GMT |
Overview
A text generation endpoint that processes chat completion requests through the full inference pipeline, returning OpenAI-format responses with support for both streaming and non-streaming modes.
Description
Chat Completion Generation is the core execution step of the OpenAI-compatible API. It:
- Receives a ChatCompletionRequest with messages and parameters
- Extracts the adapter ID from the model field
- Renders the chat template into a prompt string
- Sends the prompt through the inference engine (batching, LoRA application, decoding)
- Formats the output as an OpenAI-compatible ChatCompletionResponse or ChatCompletionStreamResponse
Streaming uses Server-Sent Events with delta objects, matching the OpenAI streaming format. The stream ends with a [DONE] sentinel.
Usage
Use when making chat completion API calls. The endpoint supports all standard OpenAI parameters (temperature, top_p, max_tokens, stop, seed) plus LoRAX-specific extensions (adapter_source, api_token).
Theoretical Basis
Pseudo-code:
# Chat completion pipeline
def chat_completions(request):
params = request.try_into_generate()
prompt = apply_chat_template(request.messages)
if request.stream:
for token in infer.generate_stream(prompt, params):
yield ChatCompletionStreamResponse(delta=token)
yield "[DONE]"
else:
result = infer.generate(prompt, params)
return ChatCompletionResponse(choices=[result])