Environment:Predibase Lorax Python Server Dependencies

Knowledge Sources	Predibase LoRAX PyTorch HuggingFace Transformers
Domains	Infrastructure, Python
Last Updated	2026-02-08 02:30 GMT

Overview

Python 3.9+ runtime with PyTorch 2.4+, Transformers 4.49+, and optional quantization libraries (GPTQ, AWQ, BitsAndBytes, EETQ, HQQ, Outlines) for the LoRAX inference server.

Description

This environment defines the Python package ecosystem required by the LoRAX server component. The server is a gRPC inference service that loads transformer models, applies LoRA adapters, and runs GPU-accelerated inference. Dependencies are layered:

Core: PyTorch, Transformers, Triton, protobuf/gRPC for model serving
Attention: Flash Attention (V1 or V2), FlashInfer for optimized attention computation
Quantization (optional): BitsAndBytes, GPTQ (ExLLaMA), AWQ, EETQ, HQQ, FP8 for reduced-precision inference
Constrained decoding (optional): Outlines for JSON schema-guided generation
Adapter management: PEFT, Accelerate for LoRA weight loading

Package versions are specified in `server/pyproject.toml` with exact pins in `server/requirements.txt`.

Usage

This environment is required for running the LoRAX Python server process (`lorax-server`). It is the software layer that sits on top of the Environment:Predibase_Lorax_CUDA_GPU_Runtime hardware environment. All model loading, inference, adapter merging, and token generation depend on these packages.

System Requirements

Category	Requirement	Notes
Python	3.9+	Specified in pyproject.toml as `python = "^3.9"`
OS	Linux x86_64	Triton 3.0.0 only available on Linux x86_64
RAM	32GB+	For model loading and tokenization

Dependencies

Core Packages

`torch` >= 2.4.0 (pinned 2.6.0)
`transformers` >= 4.49
`triton` = 3.0.0
`numpy` < 2.0
`grpcio` >= 1.51.1
`grpcio-tools` = 1.51.1
`protobuf` >= 3.20
`safetensors` >= 0.3
`huggingface-hub` >= 0.12
`loguru` >= 0.6
`boto3` (for S3 model source)
`einops` >= 0.6

Quantization Packages (Optional)

`bitsandbytes` >= 0.43.1 (deprecated, use EETQ instead)
`hqq` >= 0.1.7
`accelerate` >= 0.24.1
`peft` = 0.4.0
`outlines` >= 0.1.1

Attention Packages

`flash-attn` (V1 or V2 CUDA bindings, system-dependent)
`flashinfer` = 0.1.6 (cu124 build)

Credentials

No Python-specific credentials. See Environment:Predibase_Lorax_Model_Source_Credentials for model download tokens.

Quick Install

# Recommended: Use the official Docker image which has all dependencies pre-installed
docker pull ghcr.io/predibase/lorax:latest

# Manual install from repository:
cd server
pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir
pip install -r requirements.txt

# Generate protobuf files:
make gen-server

Code Evidence

Outlines import gating from `server/lorax_server/utils/logits_process.py:16-22`:

try:
    from outlines.fsm.guide import RegexGuide
    from outlines.fsm.json_schema import build_regex_from_schema

    HAS_OUTLINES = True
except ImportError:
    HAS_OUTLINES = False

HQQ optional import from `server/lorax_server/layers/hqq.py:5-11`:

HAS_HQQ = True
try:
    from hqq.core.quantize import BaseQuantizeConfig, HQQBackend, HQQLinear
    HQQLinear.set_backend(HQQBackend.ATEN)
except ImportError:
    HAS_HQQ = False

BitsAndBytes deprecation warning from `server/lorax_server/layers/bnb.py:7-12`:

def warn_deprecate_bnb():
    logger.warning(
        "Bitsandbytes 8bit is deprecated, using `eetq` is a drop-in replacement "
        "with better performance"
    )

PyTorch backend setup from `server/lorax_server/models/__init__.py:20-28`:

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
torch.set_grad_enabled(False)

Common Errors

Error Message	Cause	Solution
`ImportError: Flash Attention V2 is not installed`	Missing flash_attn_2_cuda module	Install with `make install-flash-attention-v2-cuda` or use Docker image
`ImportError: Could not import SGMV kernel from Punica`	Punica kernels not compiled	Build with `cd server/punica_kernels && python setup.py install`
`ImportError: No module named 'outlines'`	Outlines not installed; structured output unavailable	`pip install outlines>=0.1.1`
`ImportError: No module named 'EETQ'`	EETQ quantization not available	Build EETQ kernels or use Docker image

Compatibility Notes

Triton 3.0.0: Only available on Linux x86_64. Not available on macOS or Windows.
NumPy < 2.0: Explicitly pinned to avoid breaking changes in NumPy 2.0 API.
BitsAndBytes: Marked as deprecated in LoRAX. EETQ is the recommended replacement with better performance.
Outlines: Optional dependency. Only needed for JSON schema-constrained generation.
FlashInfer: Only cu124 build is tested. Must match the CUDA version.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment