Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Spcl Graph of thoughts Local LLaMA GPU Inference

From Leeroopedia
Knowledge Sources
Domains Infrastructure, LLM_Reasoning, GPU_Computing
Last Updated 2026-02-14 03:30 GMT

Overview

GPU-accelerated local inference environment for LLaMA-2 models using HuggingFace Transformers with 4-bit quantization.

Description

This environment provides the hardware and software stack required to run LLaMA-2 models locally through the `Llama2HF` language model backend. It uses HuggingFace Transformers with BitsAndBytes 4-bit NF4 quantization and bfloat16 compute dtype to reduce memory requirements. The model is loaded with `device_map="auto"` which automatically distributes layers across available GPUs. The `TRANSFORMERS_CACHE` environment variable is set programmatically before importing transformers to control the model download location.

Usage

Use this environment when running Graph of Thoughts workflows with local LLM inference instead of the OpenAI API. This is the prerequisite for the `Llama2HF` implementation and is required for any custom workflow that uses LLaMA-2 models (7B, 13B, or 70B variants).

System Requirements

Category Requirement Notes
OS Linux (recommended) CUDA and bitsandbytes have best Linux support
Hardware NVIDIA GPU with CUDA support Required for 4-bit quantized inference
VRAM >= 6GB (7B model), >= 10GB (13B), multiple GPUs (70B) 4-bit quantization reduces memory significantly
Disk >= 15GB (7B), >= 30GB (13B), >= 100GB (70B) For model download and cache
Network Internet access (first run only) To download model weights from HuggingFace

Dependencies

System Packages

  • NVIDIA CUDA toolkit (compatible with PyTorch version)
  • NVIDIA GPU drivers

Python Packages

  • `torch` >= 2.0.1, < 3.0.0 (with CUDA support)
  • `transformers` >= 4.31.0, < 5.0.0
  • `accelerate` >= 0.21.0, < 1.0.0
  • `bitsandbytes` >= 0.41.0, < 1.0.0

Credentials

The following credentials must be configured:

  • HuggingFace access token: Required to download LLaMA-2 model weights. Must be authenticated via `huggingface-cli login --token <your token>`.
  • Meta LLaMA-2 access: Must be requested via Meta form using the same email as the HuggingFace account. After Meta approval, accept the license on the HuggingFace model card.

Quick Install

# Install the framework
pip install graph_of_thoughts

# Ensure CUDA-compatible PyTorch is installed
pip install torch --index-url https://download.pytorch.org/whl/cu118

# Log in to HuggingFace (required for LLaMA-2 access)
huggingface-cli login --token <your-hf-token>

# Create and configure config.json
cp graph_of_thoughts/language_models/config_template.json config.json
# Set cache_dir in config.json to your desired model storage location

Code Evidence

TRANSFORMERS_CACHE override from `graph_of_thoughts/language_models/llamachat_hf.py:48-50`:

# Important: must be done before importing transformers
os.environ["TRANSFORMERS_CACHE"] = self.config["cache_dir"]
import transformers

4-bit quantization configuration from `graph_of_thoughts/language_models/llamachat_hf.py:54-59`:

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Model loading with auto device mapping from `graph_of_thoughts/language_models/llamachat_hf.py:62-68`:

self.model = transformers.AutoModelForCausalLM.from_pretrained(
    hf_model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map="auto",
)

Inference mode setup from `graph_of_thoughts/language_models/llamachat_hf.py:69-70`:

self.model.eval()
torch.no_grad()

Config template for LLaMA models from `config_template.json:22-30`:

"llama7b-hf" : {
    "model_id": "Llama-2-7b-chat-hf",
    "cache_dir": "/llama",
    "prompt_token_cost": 0.0,
    "response_token_cost": 0.0,
    "temperature": 0.6,
    "top_k": 10,
    "max_tokens": 4096
}

Common Errors

Error Message Cause Solution
`CUDA out of memory` Insufficient GPU VRAM for the model size Use a smaller model variant (7B instead of 13B) or ensure 4-bit quantization is active
`OSError: You are trying to access a gated repo` HuggingFace token not authenticated for LLaMA-2 Run `huggingface-cli login` and accept the LLaMA-2 license on HuggingFace
`ImportError: No module named 'bitsandbytes'` bitsandbytes not installed or incompatible with OS Install with `pip install bitsandbytes>=0.41.0`; note Windows support is limited
`RuntimeError: No CUDA GPUs are available` No NVIDIA GPU detected or CUDA drivers not installed Install NVIDIA CUDA drivers and verify with `nvidia-smi`

Compatibility Notes

  • Multi-GPU: The `device_map="auto"` setting automatically splits larger models (13B, 70B) across multiple GPUs when available. This is handled by the `accelerate` library.
  • Windows: BitsAndBytes has limited Windows support. Use WSL2 or Linux for best results.
  • Model Variants: Three pre-configured model sizes: `llama7b-hf` (7B), `llama13b-hf` (13B), `llama70b-hf` (70B). The 70B model requires multiple GPUs.
  • Cache Directory: The `cache_dir` config field controls where model weights are stored. The default `/llama` path requires root access; change to a user-writable directory.
  • trust_remote_code: The code sets `trust_remote_code=True` when loading the model. This is required for some model architectures but means you trust the model author's code.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment