Workflow:Predibase Lorax Server Deployment

Knowledge Sources	LoRAX LoRAX Docs Docker Getting Started
Domains	LLM_Ops, Infrastructure, Deployment
Last Updated	2026-02-08 03:00 GMT

Overview

End-to-end process for deploying a LoRAX multi-LoRA inference server with a base LLM model using Docker containers or Kubernetes Helm charts.

Description

This workflow covers the complete procedure for launching a LoRAX server instance capable of serving thousands of fine-tuned LoRA adapters on a single GPU. The deployment process involves selecting a base model from HuggingFace Hub, configuring quantization options (bitsandbytes, GPTQ, AWQ, EETQ, HQQ), launching the three-process architecture (launcher, router, and Python model server), and verifying the server is ready to accept requests. The launcher binary orchestrates the entire startup: downloading model weights, spawning Python model shard processes (one per GPU), and starting the Rust HTTP/gRPC router.

Usage

Execute this workflow when you need to deploy a LoRAX server to serve one or more LoRA-adapted LLMs in production or development. This is the prerequisite for all other LoRAX workflows. You should have an Nvidia GPU (Ampere generation or above) with CUDA 11.8+ drivers and Docker installed.

Execution Steps

Step 1: Environment_Preparation

Verify system prerequisites and install required tooling. The host must have an Nvidia GPU with Ampere architecture or above, CUDA 11.8+ compatible drivers, and Linux OS. Install Docker and the nvidia-container-toolkit to enable GPU passthrough into containers. Restart the Docker daemon after toolkit installation.

Key considerations:

Shared memory size must be set to at least 1GB for model loading
A volume mount is recommended to cache downloaded model weights across restarts
The container listens on port 80 internally; map it to your desired host port

Step 2: Base_Model_Selection

Choose a base model from HuggingFace Hub that will serve as the foundation for all LoRA adapters. LoRAX supports 15+ architectures including Llama, Mistral, Mixtral, Gemma, Phi, Qwen, GPT-2, BLOOM, and more. The base model is the shared backbone onto which task-specific LoRA adapters are dynamically loaded at request time.

Key considerations:

All adapters must be trained on the same base model used in the deployment
Base models can be loaded in fp16 or quantized with bitsandbytes, GPTQ, AWQ, EETQ, or HQQ
Quantization reduces memory footprint (e.g., 4-bit NF4 allows 7B models on consumer GPUs)

Step 3: Server_Launch

Start the LoRAX Docker container with the selected model ID and configuration options. The lorax-launcher binary inside the container orchestrates the full startup sequence: it downloads model weights (if not already cached), spawns one Python model shard per GPU, waits for shards to initialize, and then launches the Rust router process that exposes the HTTP API.

What happens internally:

Launcher downloads safetensors weights from HuggingFace Hub via the model source abstraction
Python server loads weights into GPU memory with optional quantization applied during loading
Router starts axum HTTP server with REST endpoints and OpenAI-compatible API
gRPC connections established between router and each Python shard over Unix sockets

Step 4: Health_Verification

Confirm the server is ready to accept inference requests. The router exposes a health endpoint that checks connectivity to all Python shards. The server is ready when the health check returns a successful status, indicating model weights are loaded and the inference pipeline is warmed up.

Key considerations:

Initial model loading and warmup may require significant time depending on model size
CUDA graph compilation occurs during warmup for optimized decode performance
Monitor container logs for any OOM errors or weight loading failures

Step 5: Production_Configuration

Apply production-grade settings for reliability and observability. Configure Prometheus metrics export for monitoring throughput and latency. Enable OpenTelemetry distributed tracing for request-level debugging. For Kubernetes deployments, use the provided Helm charts with appropriate resource limits and replica counts.

Key considerations:

Helm charts provide deployment templates, service definitions, and configurable values
Tensor parallelism can be enabled for multi-GPU deployments via sharding configuration
Environment variables control adapter caching, memory budgets, and CUDA graph behavior

Execution Diagram

GitHub URL

Workflow Repository