Environment:Predibase Lorax Python Server Dependencies
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Python |
| Last Updated | 2026-02-08 02:30 GMT |
Overview
Python 3.9+ runtime with PyTorch 2.4+, Transformers 4.49+, and optional quantization libraries (GPTQ, AWQ, BitsAndBytes, EETQ, HQQ, Outlines) for the LoRAX inference server.
Description
This environment defines the Python package ecosystem required by the LoRAX server component. The server is a gRPC inference service that loads transformer models, applies LoRA adapters, and runs GPU-accelerated inference. Dependencies are layered:
- Core: PyTorch, Transformers, Triton, protobuf/gRPC for model serving
- Attention: Flash Attention (V1 or V2), FlashInfer for optimized attention computation
- Quantization (optional): BitsAndBytes, GPTQ (ExLLaMA), AWQ, EETQ, HQQ, FP8 for reduced-precision inference
- Constrained decoding (optional): Outlines for JSON schema-guided generation
- Adapter management: PEFT, Accelerate for LoRA weight loading
Package versions are specified in `server/pyproject.toml` with exact pins in `server/requirements.txt`.
Usage
This environment is required for running the LoRAX Python server process (`lorax-server`). It is the software layer that sits on top of the Environment:Predibase_Lorax_CUDA_GPU_Runtime hardware environment. All model loading, inference, adapter merging, and token generation depend on these packages.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Python | 3.9+ | Specified in pyproject.toml as `python = "^3.9"` |
| OS | Linux x86_64 | Triton 3.0.0 only available on Linux x86_64 |
| RAM | 32GB+ | For model loading and tokenization |
Dependencies
Core Packages
- `torch` >= 2.4.0 (pinned 2.6.0)
- `transformers` >= 4.49
- `triton` = 3.0.0
- `numpy` < 2.0
- `grpcio` >= 1.51.1
- `grpcio-tools` = 1.51.1
- `protobuf` >= 3.20
- `safetensors` >= 0.3
- `huggingface-hub` >= 0.12
- `loguru` >= 0.6
- `boto3` (for S3 model source)
- `einops` >= 0.6
Quantization Packages (Optional)
- `bitsandbytes` >= 0.43.1 (deprecated, use EETQ instead)
- `hqq` >= 0.1.7
- `accelerate` >= 0.24.1
- `peft` = 0.4.0
- `outlines` >= 0.1.1
Attention Packages
- `flash-attn` (V1 or V2 CUDA bindings, system-dependent)
- `flashinfer` = 0.1.6 (cu124 build)
Credentials
No Python-specific credentials. See Environment:Predibase_Lorax_Model_Source_Credentials for model download tokens.
Quick Install
# Recommended: Use the official Docker image which has all dependencies pre-installed
docker pull ghcr.io/predibase/lorax:latest
# Manual install from repository:
cd server
pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir
pip install -r requirements.txt
# Generate protobuf files:
make gen-server
Code Evidence
Outlines import gating from `server/lorax_server/utils/logits_process.py:16-22`:
try:
from outlines.fsm.guide import RegexGuide
from outlines.fsm.json_schema import build_regex_from_schema
HAS_OUTLINES = True
except ImportError:
HAS_OUTLINES = False
HQQ optional import from `server/lorax_server/layers/hqq.py:5-11`:
HAS_HQQ = True
try:
from hqq.core.quantize import BaseQuantizeConfig, HQQBackend, HQQLinear
HQQLinear.set_backend(HQQBackend.ATEN)
except ImportError:
HAS_HQQ = False
BitsAndBytes deprecation warning from `server/lorax_server/layers/bnb.py:7-12`:
def warn_deprecate_bnb():
logger.warning(
"Bitsandbytes 8bit is deprecated, using `eetq` is a drop-in replacement "
"with better performance"
)
PyTorch backend setup from `server/lorax_server/models/__init__.py:20-28`:
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
torch.set_grad_enabled(False)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: Flash Attention V2 is not installed` | Missing flash_attn_2_cuda module | Install with `make install-flash-attention-v2-cuda` or use Docker image |
| `ImportError: Could not import SGMV kernel from Punica` | Punica kernels not compiled | Build with `cd server/punica_kernels && python setup.py install` |
| `ImportError: No module named 'outlines'` | Outlines not installed; structured output unavailable | `pip install outlines>=0.1.1` |
| `ImportError: No module named 'EETQ'` | EETQ quantization not available | Build EETQ kernels or use Docker image |
Compatibility Notes
- Triton 3.0.0: Only available on Linux x86_64. Not available on macOS or Windows.
- NumPy < 2.0: Explicitly pinned to avoid breaking changes in NumPy 2.0 API.
- BitsAndBytes: Marked as deprecated in LoRAX. EETQ is the recommended replacement with better performance.
- Outlines: Optional dependency. Only needed for JSON schema-constrained generation.
- FlashInfer: Only cu124 build is tested. Must match the CUDA version.