Workflow:Predibase Lorax Single LoRA Inference
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Inference, LoRA |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
End-to-end process for performing inference through a LoRAX server using a dynamically loaded LoRA adapter from HuggingFace Hub, S3, or local filesystem.
Description
This workflow covers the complete request lifecycle for generating text with a fine-tuned LoRA adapter on a running LoRAX server. The process spans from constructing a request with an adapter ID, through the server's dynamic adapter loading and batched inference pipeline, to receiving the generated response. LoRAX loads LoRA adapters just-in-time without blocking concurrent requests, and supports batching requests for different adapters together via heterogeneous continuous batching with SGMV/BGMV Triton kernels.
Usage
Execute this workflow when you have a running LoRAX server (see Server_Deployment workflow) and want to generate text using a specific LoRA adapter. You need either the adapter ID on HuggingFace Hub, an S3 path, or a local filesystem path to the adapter weights.
Execution Steps
Step 1: Client_Setup
Install and configure the LoRAX Python client or prepare REST API calls. The Python client wraps the HTTP/SSE API and provides both synchronous and asynchronous interfaces. Alternatively, use curl or any HTTP client to interact with the REST endpoints directly.
Key considerations:
- The Python client is installed via pip as the lorax-client package
- Both synchronous (Client) and asynchronous (AsyncClient) clients are available
- For private adapters, include an authorization token in the client headers
Step 2: Request_Construction
Build the inference request with prompt text, adapter parameters, and generation settings. The adapter_id specifies which LoRA adapter to load, and adapter_source indicates where to find it (hub, s3, local, or pbase). Generation parameters control output behavior including max_new_tokens, temperature, top_p, top_k, repetition_penalty, and stop sequences.
Key considerations:
- adapter_source defaults to "hub" (HuggingFace Hub) if not specified
- The adapter must be trained on the same base model deployed in the server
- Omitting adapter_id uses the base model without any adapter applied
Step 3: Adapter_Loading
When a request arrives with an adapter_id, the router checks if the adapter is already cached in GPU memory. If not, the adapter weights are downloaded from the specified source (HuggingFace Hub, S3, or local path) and loaded into GPU memory. An LRU cache manages adapter lifecycle, automatically offloading least-recently-used adapters to CPU memory when GPU capacity is reached.
What happens internally:
- Router's adapter loader checks the adapter cache for a hit
- On cache miss, a download request is sent to Python shards via gRPC
- Adapter weights (safetensors or bin format) are downloaded and loaded
- LoRA A/B weight matrices are injected into the PunicaWrapper for batched computation
- Authorization is verified per-request even for cached adapters
Step 4: Batched_Inference
The request enters the continuous batching scheduler where it is combined with other pending requests (potentially for different adapters) into a single batch. The prefill phase processes all input tokens, and the decode phase generates output tokens one at a time. The SGMV/BGMV Triton kernels enable efficient batched LoRA computation across multiple adapters in the same batch.
What happens internally:
- Request enters the queue and is packed into the next available batch slot
- Prefill: all input tokens processed through the model with flash attention
- Decode: tokens generated autoregressively with paged attention for KV cache
- LoRA deltas applied via PunicaWrapper during each forward pass
- Generation stops at max_new_tokens, EOS token, or stop sequence
Step 5: Response_Handling
Receive and process the generated response. For non-streaming requests, the complete generated text is returned as a single response with generation details (token count, finish reason, timing). For streaming requests, tokens are returned one at a time via server-sent events (SSE), enabling real-time display of generated text.
Key considerations:
- Streaming mode provides lower time-to-first-token perception
- Response includes finish_reason: length (max tokens), eos_token, or stop_sequence
- Token logprobs and alternative tokens can be requested for analysis