Workflow:Predibase Lorax Server Deployment
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Infrastructure, Deployment |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
End-to-end process for deploying a LoRAX multi-LoRA inference server with a base LLM model using Docker containers or Kubernetes Helm charts.
Description
This workflow covers the complete procedure for launching a LoRAX server instance capable of serving thousands of fine-tuned LoRA adapters on a single GPU. The deployment process involves selecting a base model from HuggingFace Hub, configuring quantization options (bitsandbytes, GPTQ, AWQ, EETQ, HQQ), launching the three-process architecture (launcher, router, and Python model server), and verifying the server is ready to accept requests. The launcher binary orchestrates the entire startup: downloading model weights, spawning Python model shard processes (one per GPU), and starting the Rust HTTP/gRPC router.
Usage
Execute this workflow when you need to deploy a LoRAX server to serve one or more LoRA-adapted LLMs in production or development. This is the prerequisite for all other LoRAX workflows. You should have an Nvidia GPU (Ampere generation or above) with CUDA 11.8+ drivers and Docker installed.
Execution Steps
Step 1: Environment_Preparation
Verify system prerequisites and install required tooling. The host must have an Nvidia GPU with Ampere architecture or above, CUDA 11.8+ compatible drivers, and Linux OS. Install Docker and the nvidia-container-toolkit to enable GPU passthrough into containers. Restart the Docker daemon after toolkit installation.
Key considerations:
- Shared memory size must be set to at least 1GB for model loading
- A volume mount is recommended to cache downloaded model weights across restarts
- The container listens on port 80 internally; map it to your desired host port
Step 2: Base_Model_Selection
Choose a base model from HuggingFace Hub that will serve as the foundation for all LoRA adapters. LoRAX supports 15+ architectures including Llama, Mistral, Mixtral, Gemma, Phi, Qwen, GPT-2, BLOOM, and more. The base model is the shared backbone onto which task-specific LoRA adapters are dynamically loaded at request time.
Key considerations:
- All adapters must be trained on the same base model used in the deployment
- Base models can be loaded in fp16 or quantized with bitsandbytes, GPTQ, AWQ, EETQ, or HQQ
- Quantization reduces memory footprint (e.g., 4-bit NF4 allows 7B models on consumer GPUs)
Step 3: Server_Launch
Start the LoRAX Docker container with the selected model ID and configuration options. The lorax-launcher binary inside the container orchestrates the full startup sequence: it downloads model weights (if not already cached), spawns one Python model shard per GPU, waits for shards to initialize, and then launches the Rust router process that exposes the HTTP API.
What happens internally:
- Launcher downloads safetensors weights from HuggingFace Hub via the model source abstraction
- Python server loads weights into GPU memory with optional quantization applied during loading
- Router starts axum HTTP server with REST endpoints and OpenAI-compatible API
- gRPC connections established between router and each Python shard over Unix sockets
Step 4: Health_Verification
Confirm the server is ready to accept inference requests. The router exposes a health endpoint that checks connectivity to all Python shards. The server is ready when the health check returns a successful status, indicating model weights are loaded and the inference pipeline is warmed up.
Key considerations:
- Initial model loading and warmup may require significant time depending on model size
- CUDA graph compilation occurs during warmup for optimized decode performance
- Monitor container logs for any OOM errors or weight loading failures
Step 5: Production_Configuration
Apply production-grade settings for reliability and observability. Configure Prometheus metrics export for monitoring throughput and latency. Enable OpenTelemetry distributed tracing for request-level debugging. For Kubernetes deployments, use the provided Helm charts with appropriate resource limits and replica counts.
Key considerations:
- Helm charts provide deployment templates, service definitions, and configurable values
- Tensor parallelism can be enabled for multi-GPU deployments via sharding configuration
- Environment variables control adapter caching, memory budgets, and CUDA graph behavior