Environment:Mit han lab Llm awq Python Runtime Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning |
| Last Updated | 2026-02-15 01:00 GMT |
Overview
Python 3.8+ runtime with PyTorch 2.3.0, Transformers 4.46.0, and pinned dependencies for AWQ quantization and evaluation.
Description
This environment provides the core Python runtime and library stack required for all AWQ operations: model loading, quantization, evaluation, and export. The project uses strict version pinning for most critical dependencies (PyTorch, Transformers, Accelerate, lm-eval) to ensure reproducible quantization results. The environment includes HuggingFace ecosystem libraries for model management, lm-eval-harness for benchmark evaluation, and Gradio for the TinyChat serving UI.
Usage
Use this environment for all AWQ operations including model quantization (`awq/entry.py`), perplexity evaluation, lm-eval-harness benchmarks, HuggingFace model export, and TinyChat inference. This is the base prerequisite for every Implementation in the repository.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+ recommended) | Jetson requires JetPack-compatible Python |
| Hardware | NVIDIA GPU with CUDA support | CPU-only not supported for quantization or inference |
| RAM | 32GB+ recommended | Large models require significant host memory for loading |
| Disk | 50GB+ | Model checkpoints and calibration data |
Dependencies
Python Runtime
- `python` >= 3.8 (3.10 recommended; 3.8 for Jetson JetPack 5)
Core Packages (Pinned Versions)
- `torch` == 2.3.0
- `torchvision` == 0.18.0
- `transformers` == 4.46.0
- `accelerate` == 0.34.2
- `lm_eval` == 0.3.0
- `gradio` == 3.35.2
- `gradio_client` == 0.2.9
- `pydantic` == 1.10.19
Core Packages (Flexible Versions)
- `tokenizers` >= 0.12.1
- `sentencepiece`
- `texttable`
- `toml`
- `attributedict`
- `protobuf`
- `fastapi`
- `uvicorn`
Credentials
The following environment variables may be needed depending on usage:
- `CUDA_VISIBLE_DEVICES`: Controls which GPUs are visible (auto-parallel assumes up to 8 if unset)
- `PYTORCH_CUDA_ALLOC_CONF`: Set to `expandable_segments:True` for InternVL3 inference
- `OPENAI_API_KEY`: Only required if using content moderation via `log_utils.violates_moderation()`
Quick Install
# Install AWQ and all dependencies
pip install -e .
# For Jetson devices: comment out torch==2.3.0 in pyproject.toml first,
# then install NVIDIA prebuilt PyTorch >= 2.0.0
Code Evidence
Version pinning from `pyproject.toml:10-24`:
requires-python = ">=3.8"
dependencies = [
"accelerate==0.34.2", "sentencepiece", "tokenizers>=0.12.1",
"torch==2.3.0", "torchvision==0.18.0",
"transformers==4.46.0",
"lm_eval==0.3.0", "texttable",
"toml", "attributedict",
"protobuf",
"gradio==3.35.2", "gradio_client==0.2.9",
"fastapi", "uvicorn",
"pydantic==1.10.19"
]
OOM prevention with transformers cache from `awq/entry.py:142`:
# Note (Haotian): To avoid OOM after huggingface transformers 4.36.2
config.use_cache = False
Multi-GPU auto-parallel from `awq/utils/parallel.py:19-27`:
cuda_visible_devices = os.environ.get("CUDA_VISIBLE_DEVICES", None)
if isinstance(cuda_visible_devices, str):
cuda_visible_devices = cuda_visible_devices.split(",")
else:
cuda_visible_devices = list(range(8))
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: No module named 'lm_eval'` | lm-eval not installed | `pip install lm_eval==0.3.0` |
| OOM with HuggingFace transformers >= 4.36.2 | KV cache consumes too much memory | Set `config.use_cache = False` (done automatically in entry.py) |
| `CUDA out of memory` during model loading | GPU VRAM insufficient | Use `--max_memory 0:10GiB cpu:30GiB` for device mapping |
| Version conflict with transformers | Mismatched transformers version | Pin to `transformers==4.46.0` |
Compatibility Notes
- Jetson (Edge): Must remove `torch==2.3.0` pin from `pyproject.toml` and install NVIDIA prebuilt PyTorch >= 2.0.0. Use Python 3.8 for JetPack 5.
- Multi-GPU: Auto-parallel in `parallel.py` infers GPU count from model size: <20GB=1 GPU, 20-50GB=4 GPUs, >50GB=8 GPUs.
- lm-eval Version: Pinned to 0.3.0 which uses the `BaseLM` adapter interface. Newer versions (0.4+) have a different API.
Related Pages
- Implementation:Mit_han_lab_Llm_awq_Get_calib_dataset
- Implementation:Mit_han_lab_Llm_awq_Run_awq
- Implementation:Mit_han_lab_Llm_awq_Auto_scale_block
- Implementation:Mit_han_lab_Llm_awq_Auto_clip_block
- Implementation:Mit_han_lab_Llm_awq_Apply_awq
- Implementation:Mit_han_lab_Llm_awq_Real_quantize_model_weight
- Implementation:Mit_han_lab_Llm_awq_Pseudo_quantize_model_weight
- Implementation:Mit_han_lab_Llm_awq_LMEvalAdaptor
- Implementation:Mit_han_lab_Llm_awq_Awq_config_export
- Implementation:Mit_han_lab_Llm_awq_Wikitext_eval_loop