Workflow:Predibase Lorax OpenAI Chat Completion

Knowledge Sources	LoRAX LoRAX Docs OpenAI Compatible API
Domains	LLM_Ops, Inference, Chat
Last Updated	2026-02-08 03:00 GMT

Overview

End-to-end process for conducting multi-turn chat conversations through the LoRAX OpenAI-compatible Chat Completions API with dynamic LoRA adapter selection.

Description

This workflow covers using the OpenAI-compatible v1/chat/completions endpoint provided by LoRAX for multi-turn chat conversations. The standard OpenAI Python SDK can be used as a drop-in client by pointing it at the LoRAX server. The model parameter selects which LoRA adapter to apply, enabling per-conversation adapter specialization. Chat templates from HuggingFace are automatically applied to format conversation history for the model.

Usage

Execute this workflow when you want to interact with a LoRAX-served model using the familiar OpenAI SDK interface, particularly for multi-turn chat conversations. This is the recommended approach for chat-based applications where conversation history management and adapter selection per conversation are needed.

Execution Steps

Step 1: Client_Configuration

Configure the OpenAI Python SDK to connect to the LoRAX server. Replace the base_url with the LoRAX endpoint appended with /v1. The api_key can be set to any value as it is not used for authentication by default. For private adapters on Predibase or HuggingFace, pass the token as the api_key.

Key considerations:

The base_url must end with /v1 for the OpenAI-compatible API
Both synchronous and streaming modes are supported
The v1/completions endpoint is also available for non-chat use cases

Step 2: Adapter_Selection

Specify the LoRA adapter to use via the model parameter in the API request. Setting model to an empty string uses the base model without any adapter. Any adapter ID from HuggingFace Hub can be specified directly. The adapter is loaded dynamically on first use and cached for subsequent requests.

Key considerations:

The adapter must be trained on the same base model deployed in the server
Different conversations can use different adapters by changing the model parameter
Chat templates are resolved from the adapter's tokenizer if available, falling back to the base model's template

Step 3: Conversation_Construction

Build the messages array following the OpenAI chat format with role/content pairs. Supported roles are system, user, and assistant. The system message sets the behavioral context. The conversation history is passed in its entirety with each request, as the server is stateless.

What happens internally:

The messages array is converted to a single prompt string using the model's chat template
HuggingFace tokenizer chat_template Jinja2 format is applied
Special tokens for role boundaries are inserted based on the template
If no chat template exists on adapter or base model, an error is returned

Step 4: Completion_Generation

Submit the chat completion request and receive the generated response. The server processes the formatted prompt through the base model with the selected LoRA adapter applied. Generation parameters like max_tokens, temperature, and top_p control the output. Streaming mode returns tokens incrementally via SSE.

Key considerations:

The response follows the standard OpenAI ChatCompletion format
Streaming is enabled with stream=True parameter
Structured output (JSON mode) is supported via the response_format parameter

Step 5: Conversation_Management

Maintain conversation state client-side by appending the assistant's response to the messages array for subsequent turns. Each new request includes the full conversation history, allowing the model to maintain context across turns.

Key considerations:

The server is stateless; full conversation history must be sent each turn
Token limits apply to the total prompt including conversation history
Truncation may be needed for long conversations to stay within context window

Execution Diagram

GitHub URL

Workflow Repository