Environment:Spcl Graph of thoughts Local LLaMA GPU Inference
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, LLM_Reasoning, GPU_Computing |
| Last Updated | 2026-02-14 03:30 GMT |
Overview
GPU-accelerated local inference environment for LLaMA-2 models using HuggingFace Transformers with 4-bit quantization.
Description
This environment provides the hardware and software stack required to run LLaMA-2 models locally through the `Llama2HF` language model backend. It uses HuggingFace Transformers with BitsAndBytes 4-bit NF4 quantization and bfloat16 compute dtype to reduce memory requirements. The model is loaded with `device_map="auto"` which automatically distributes layers across available GPUs. The `TRANSFORMERS_CACHE` environment variable is set programmatically before importing transformers to control the model download location.
Usage
Use this environment when running Graph of Thoughts workflows with local LLM inference instead of the OpenAI API. This is the prerequisite for the `Llama2HF` implementation and is required for any custom workflow that uses LLaMA-2 models (7B, 13B, or 70B variants).
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (recommended) | CUDA and bitsandbytes have best Linux support |
| Hardware | NVIDIA GPU with CUDA support | Required for 4-bit quantized inference |
| VRAM | >= 6GB (7B model), >= 10GB (13B), multiple GPUs (70B) | 4-bit quantization reduces memory significantly |
| Disk | >= 15GB (7B), >= 30GB (13B), >= 100GB (70B) | For model download and cache |
| Network | Internet access (first run only) | To download model weights from HuggingFace |
Dependencies
System Packages
- NVIDIA CUDA toolkit (compatible with PyTorch version)
- NVIDIA GPU drivers
Python Packages
- `torch` >= 2.0.1, < 3.0.0 (with CUDA support)
- `transformers` >= 4.31.0, < 5.0.0
- `accelerate` >= 0.21.0, < 1.0.0
- `bitsandbytes` >= 0.41.0, < 1.0.0
Credentials
The following credentials must be configured:
- HuggingFace access token: Required to download LLaMA-2 model weights. Must be authenticated via `huggingface-cli login --token <your token>`.
- Meta LLaMA-2 access: Must be requested via Meta form using the same email as the HuggingFace account. After Meta approval, accept the license on the HuggingFace model card.
Quick Install
# Install the framework
pip install graph_of_thoughts
# Ensure CUDA-compatible PyTorch is installed
pip install torch --index-url https://download.pytorch.org/whl/cu118
# Log in to HuggingFace (required for LLaMA-2 access)
huggingface-cli login --token <your-hf-token>
# Create and configure config.json
cp graph_of_thoughts/language_models/config_template.json config.json
# Set cache_dir in config.json to your desired model storage location
Code Evidence
TRANSFORMERS_CACHE override from `graph_of_thoughts/language_models/llamachat_hf.py:48-50`:
# Important: must be done before importing transformers
os.environ["TRANSFORMERS_CACHE"] = self.config["cache_dir"]
import transformers
4-bit quantization configuration from `graph_of_thoughts/language_models/llamachat_hf.py:54-59`:
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
Model loading with auto device mapping from `graph_of_thoughts/language_models/llamachat_hf.py:62-68`:
self.model = transformers.AutoModelForCausalLM.from_pretrained(
hf_model_id,
trust_remote_code=True,
config=model_config,
quantization_config=bnb_config,
device_map="auto",
)
Inference mode setup from `graph_of_thoughts/language_models/llamachat_hf.py:69-70`:
self.model.eval()
torch.no_grad()
Config template for LLaMA models from `config_template.json:22-30`:
"llama7b-hf" : {
"model_id": "Llama-2-7b-chat-hf",
"cache_dir": "/llama",
"prompt_token_cost": 0.0,
"response_token_cost": 0.0,
"temperature": 0.6,
"top_k": 10,
"max_tokens": 4096
}
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `CUDA out of memory` | Insufficient GPU VRAM for the model size | Use a smaller model variant (7B instead of 13B) or ensure 4-bit quantization is active |
| `OSError: You are trying to access a gated repo` | HuggingFace token not authenticated for LLaMA-2 | Run `huggingface-cli login` and accept the LLaMA-2 license on HuggingFace |
| `ImportError: No module named 'bitsandbytes'` | bitsandbytes not installed or incompatible with OS | Install with `pip install bitsandbytes>=0.41.0`; note Windows support is limited |
| `RuntimeError: No CUDA GPUs are available` | No NVIDIA GPU detected or CUDA drivers not installed | Install NVIDIA CUDA drivers and verify with `nvidia-smi` |
Compatibility Notes
- Multi-GPU: The `device_map="auto"` setting automatically splits larger models (13B, 70B) across multiple GPUs when available. This is handled by the `accelerate` library.
- Windows: BitsAndBytes has limited Windows support. Use WSL2 or Linux for best results.
- Model Variants: Three pre-configured model sizes: `llama7b-hf` (7B), `llama13b-hf` (13B), `llama70b-hf` (70B). The 70B model requires multiple GPUs.
- Cache Directory: The `cache_dir` config field controls where model weights are stored. The default `/llama` path requires root access; change to a user-writable directory.
- trust_remote_code: The code sets `trust_remote_code=True` when loading the model. This is required for some model architectures but means you trust the model author's code.