Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Predibase Lorax Python Server Dependencies

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Python
Last Updated 2026-02-08 02:30 GMT

Overview

Python 3.9+ runtime with PyTorch 2.4+, Transformers 4.49+, and optional quantization libraries (GPTQ, AWQ, BitsAndBytes, EETQ, HQQ, Outlines) for the LoRAX inference server.

Description

This environment defines the Python package ecosystem required by the LoRAX server component. The server is a gRPC inference service that loads transformer models, applies LoRA adapters, and runs GPU-accelerated inference. Dependencies are layered:

  • Core: PyTorch, Transformers, Triton, protobuf/gRPC for model serving
  • Attention: Flash Attention (V1 or V2), FlashInfer for optimized attention computation
  • Quantization (optional): BitsAndBytes, GPTQ (ExLLaMA), AWQ, EETQ, HQQ, FP8 for reduced-precision inference
  • Constrained decoding (optional): Outlines for JSON schema-guided generation
  • Adapter management: PEFT, Accelerate for LoRA weight loading

Package versions are specified in `server/pyproject.toml` with exact pins in `server/requirements.txt`.

Usage

This environment is required for running the LoRAX Python server process (`lorax-server`). It is the software layer that sits on top of the Environment:Predibase_Lorax_CUDA_GPU_Runtime hardware environment. All model loading, inference, adapter merging, and token generation depend on these packages.

System Requirements

Category Requirement Notes
Python 3.9+ Specified in pyproject.toml as `python = "^3.9"`
OS Linux x86_64 Triton 3.0.0 only available on Linux x86_64
RAM 32GB+ For model loading and tokenization

Dependencies

Core Packages

  • `torch` >= 2.4.0 (pinned 2.6.0)
  • `transformers` >= 4.49
  • `triton` = 3.0.0
  • `numpy` < 2.0
  • `grpcio` >= 1.51.1
  • `grpcio-tools` = 1.51.1
  • `protobuf` >= 3.20
  • `safetensors` >= 0.3
  • `huggingface-hub` >= 0.12
  • `loguru` >= 0.6
  • `boto3` (for S3 model source)
  • `einops` >= 0.6

Quantization Packages (Optional)

  • `bitsandbytes` >= 0.43.1 (deprecated, use EETQ instead)
  • `hqq` >= 0.1.7
  • `accelerate` >= 0.24.1
  • `peft` = 0.4.0
  • `outlines` >= 0.1.1

Attention Packages

  • `flash-attn` (V1 or V2 CUDA bindings, system-dependent)
  • `flashinfer` = 0.1.6 (cu124 build)

Credentials

No Python-specific credentials. See Environment:Predibase_Lorax_Model_Source_Credentials for model download tokens.

Quick Install

# Recommended: Use the official Docker image which has all dependencies pre-installed
docker pull ghcr.io/predibase/lorax:latest

# Manual install from repository:
cd server
pip install ".[bnb, accelerate, quantize, peft, outlines]" --no-cache-dir
pip install -r requirements.txt

# Generate protobuf files:
make gen-server

Code Evidence

Outlines import gating from `server/lorax_server/utils/logits_process.py:16-22`:

try:
    from outlines.fsm.guide import RegexGuide
    from outlines.fsm.json_schema import build_regex_from_schema

    HAS_OUTLINES = True
except ImportError:
    HAS_OUTLINES = False

HQQ optional import from `server/lorax_server/layers/hqq.py:5-11`:

HAS_HQQ = True
try:
    from hqq.core.quantize import BaseQuantizeConfig, HQQBackend, HQQLinear
    HQQLinear.set_backend(HQQBackend.ATEN)
except ImportError:
    HAS_HQQ = False

BitsAndBytes deprecation warning from `server/lorax_server/layers/bnb.py:7-12`:

def warn_deprecate_bnb():
    logger.warning(
        "Bitsandbytes 8bit is deprecated, using `eetq` is a drop-in replacement "
        "with better performance"
    )

PyTorch backend setup from `server/lorax_server/models/__init__.py:20-28`:

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
torch.set_grad_enabled(False)

Common Errors

Error Message Cause Solution
`ImportError: Flash Attention V2 is not installed` Missing flash_attn_2_cuda module Install with `make install-flash-attention-v2-cuda` or use Docker image
`ImportError: Could not import SGMV kernel from Punica` Punica kernels not compiled Build with `cd server/punica_kernels && python setup.py install`
`ImportError: No module named 'outlines'` Outlines not installed; structured output unavailable `pip install outlines>=0.1.1`
`ImportError: No module named 'EETQ'` EETQ quantization not available Build EETQ kernels or use Docker image

Compatibility Notes

  • Triton 3.0.0: Only available on Linux x86_64. Not available on macOS or Windows.
  • NumPy < 2.0: Explicitly pinned to avoid breaking changes in NumPy 2.0 API.
  • BitsAndBytes: Marked as deprecated in LoRAX. EETQ is the recommended replacement with better performance.
  • Outlines: Optional dependency. Only needed for JSON schema-constrained generation.
  • FlashInfer: Only cu124 build is tested. Must match the CUDA version.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment