Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Predibase Lorax Single LoRA Inference

From Leeroopedia


Knowledge Sources
Domains LLM_Ops, Inference, LoRA
Last Updated 2026-02-08 03:00 GMT

Overview

End-to-end process for performing inference through a LoRAX server using a dynamically loaded LoRA adapter from HuggingFace Hub, S3, or local filesystem.

Description

This workflow covers the complete request lifecycle for generating text with a fine-tuned LoRA adapter on a running LoRAX server. The process spans from constructing a request with an adapter ID, through the server's dynamic adapter loading and batched inference pipeline, to receiving the generated response. LoRAX loads LoRA adapters just-in-time without blocking concurrent requests, and supports batching requests for different adapters together via heterogeneous continuous batching with SGMV/BGMV Triton kernels.

Usage

Execute this workflow when you have a running LoRAX server (see Server_Deployment workflow) and want to generate text using a specific LoRA adapter. You need either the adapter ID on HuggingFace Hub, an S3 path, or a local filesystem path to the adapter weights.

Execution Steps

Step 1: Client_Setup

Install and configure the LoRAX Python client or prepare REST API calls. The Python client wraps the HTTP/SSE API and provides both synchronous and asynchronous interfaces. Alternatively, use curl or any HTTP client to interact with the REST endpoints directly.

Key considerations:

  • The Python client is installed via pip as the lorax-client package
  • Both synchronous (Client) and asynchronous (AsyncClient) clients are available
  • For private adapters, include an authorization token in the client headers

Step 2: Request_Construction

Build the inference request with prompt text, adapter parameters, and generation settings. The adapter_id specifies which LoRA adapter to load, and adapter_source indicates where to find it (hub, s3, local, or pbase). Generation parameters control output behavior including max_new_tokens, temperature, top_p, top_k, repetition_penalty, and stop sequences.

Key considerations:

  • adapter_source defaults to "hub" (HuggingFace Hub) if not specified
  • The adapter must be trained on the same base model deployed in the server
  • Omitting adapter_id uses the base model without any adapter applied

Step 3: Adapter_Loading

When a request arrives with an adapter_id, the router checks if the adapter is already cached in GPU memory. If not, the adapter weights are downloaded from the specified source (HuggingFace Hub, S3, or local path) and loaded into GPU memory. An LRU cache manages adapter lifecycle, automatically offloading least-recently-used adapters to CPU memory when GPU capacity is reached.

What happens internally:

  • Router's adapter loader checks the adapter cache for a hit
  • On cache miss, a download request is sent to Python shards via gRPC
  • Adapter weights (safetensors or bin format) are downloaded and loaded
  • LoRA A/B weight matrices are injected into the PunicaWrapper for batched computation
  • Authorization is verified per-request even for cached adapters

Step 4: Batched_Inference

The request enters the continuous batching scheduler where it is combined with other pending requests (potentially for different adapters) into a single batch. The prefill phase processes all input tokens, and the decode phase generates output tokens one at a time. The SGMV/BGMV Triton kernels enable efficient batched LoRA computation across multiple adapters in the same batch.

What happens internally:

  • Request enters the queue and is packed into the next available batch slot
  • Prefill: all input tokens processed through the model with flash attention
  • Decode: tokens generated autoregressively with paged attention for KV cache
  • LoRA deltas applied via PunicaWrapper during each forward pass
  • Generation stops at max_new_tokens, EOS token, or stop sequence

Step 5: Response_Handling

Receive and process the generated response. For non-streaming requests, the complete generated text is returned as a single response with generation details (token count, finish reason, timing). For streaming requests, tokens are returned one at a time via server-sent events (SSE), enabling real-time display of generated text.

Key considerations:

  • Streaming mode provides lower time-to-first-token perception
  • Response includes finish_reason: length (max tokens), eos_token, or stop_sequence
  • Token logprobs and alternative tokens can be requested for analysis

Execution Diagram

GitHub URL

Workflow Repository