Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Mit han lab Llm awq Python Runtime Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Deep_Learning
Last Updated 2026-02-15 01:00 GMT

Overview

Python 3.8+ runtime with PyTorch 2.3.0, Transformers 4.46.0, and pinned dependencies for AWQ quantization and evaluation.

Description

This environment provides the core Python runtime and library stack required for all AWQ operations: model loading, quantization, evaluation, and export. The project uses strict version pinning for most critical dependencies (PyTorch, Transformers, Accelerate, lm-eval) to ensure reproducible quantization results. The environment includes HuggingFace ecosystem libraries for model management, lm-eval-harness for benchmark evaluation, and Gradio for the TinyChat serving UI.

Usage

Use this environment for all AWQ operations including model quantization (`awq/entry.py`), perplexity evaluation, lm-eval-harness benchmarks, HuggingFace model export, and TinyChat inference. This is the base prerequisite for every Implementation in the repository.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 20.04+ recommended) Jetson requires JetPack-compatible Python
Hardware NVIDIA GPU with CUDA support CPU-only not supported for quantization or inference
RAM 32GB+ recommended Large models require significant host memory for loading
Disk 50GB+ Model checkpoints and calibration data

Dependencies

Python Runtime

  • `python` >= 3.8 (3.10 recommended; 3.8 for Jetson JetPack 5)

Core Packages (Pinned Versions)

  • `torch` == 2.3.0
  • `torchvision` == 0.18.0
  • `transformers` == 4.46.0
  • `accelerate` == 0.34.2
  • `lm_eval` == 0.3.0
  • `gradio` == 3.35.2
  • `gradio_client` == 0.2.9
  • `pydantic` == 1.10.19

Core Packages (Flexible Versions)

  • `tokenizers` >= 0.12.1
  • `sentencepiece`
  • `texttable`
  • `toml`
  • `attributedict`
  • `protobuf`
  • `fastapi`
  • `uvicorn`

Credentials

The following environment variables may be needed depending on usage:

  • `CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible (auto-parallel assumes up to 8 if unset)
  • `PYTORCH_CUDA_ALLOC_CONF`: Set to `expandable_segments:True` for InternVL3 inference
  • `OPENAI_API_KEY`: Only required if using content moderation via `log_utils.violates_moderation()`

Quick Install

# Install AWQ and all dependencies
pip install -e .

# For Jetson devices: comment out torch==2.3.0 in pyproject.toml first,
# then install NVIDIA prebuilt PyTorch >= 2.0.0

Code Evidence

Version pinning from `pyproject.toml:10-24`:

requires-python = ">=3.8"
dependencies = [
    "accelerate==0.34.2", "sentencepiece", "tokenizers>=0.12.1",
    "torch==2.3.0", "torchvision==0.18.0",
    "transformers==4.46.0",
    "lm_eval==0.3.0", "texttable",
    "toml", "attributedict",
    "protobuf",
    "gradio==3.35.2", "gradio_client==0.2.9",
    "fastapi", "uvicorn",
    "pydantic==1.10.19"
]

OOM prevention with transformers cache from `awq/entry.py:142`:

# Note (Haotian): To avoid OOM after huggingface transformers 4.36.2
config.use_cache = False

Multi-GPU auto-parallel from `awq/utils/parallel.py:19-27`:

cuda_visible_devices = os.environ.get("CUDA_VISIBLE_DEVICES", None)
if isinstance(cuda_visible_devices, str):
    cuda_visible_devices = cuda_visible_devices.split(",")
else:
    cuda_visible_devices = list(range(8))

Common Errors

Error Message Cause Solution
`ImportError: No module named 'lm_eval'` lm-eval not installed `pip install lm_eval==0.3.0`
OOM with HuggingFace transformers >= 4.36.2 KV cache consumes too much memory Set `config.use_cache = False` (done automatically in entry.py)
`CUDA out of memory` during model loading GPU VRAM insufficient Use `--max_memory 0:10GiB cpu:30GiB` for device mapping
Version conflict with transformers Mismatched transformers version Pin to `transformers==4.46.0`

Compatibility Notes

  • Jetson (Edge): Must remove `torch==2.3.0` pin from `pyproject.toml` and install NVIDIA prebuilt PyTorch >= 2.0.0. Use Python 3.8 for JetPack 5.
  • Multi-GPU: Auto-parallel in `parallel.py` infers GPU count from model size: <20GB=1 GPU, 20-50GB=4 GPUs, >50GB=8 GPUs.
  • lm-eval Version: Pinned to 0.3.0 which uses the `BaseLM` adapter interface. Newer versions (0.4+) have a different API.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment