Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Spcl Graph of thoughts Llama2HF

From Leeroopedia
Knowledge Sources
Domains LLM_Integration, Quantization
Principles Principle:Spcl_Graph_of_thoughts_Local_LLM_Inference
Environments Environment:Spcl_Graph_of_thoughts_Local_LLaMA_GPU_Inference
Source File graph_of_thoughts/language_models/llamachat_hf.py, Lines 15-120
Last Updated 2026-02-14

Overview

The Llama2HF class provides a concrete implementation of AbstractLanguageModel that runs LLaMA-2 models locally via HuggingFace Transformers with BitsAndBytes 4-bit quantization. It supports configurable model variants, top-k sampling, response caching, and formats all queries using the LLaMA-2 chat template.

Import

from graph_of_thoughts.language_models import Llama2HF

Class Signature

class Llama2HF(AbstractLanguageModel):
    def __init__(
        self, config_path: str = "", model_name: str = "llama7b-hf", cache: bool = False
    ) -> None: ...

    def query(
        self, query: str, num_responses: int = 1
    ) -> List[Dict]: ...

    def get_response_texts(
        self, query_responses: List[Dict]
    ) -> List[str]: ...

External Dependencies

  • torch -- PyTorch, required for tensor operations and bfloat16 compute dtype
  • transformers -- HuggingFace Transformers library (provides AutoConfig, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline)
  • bitsandbytes -- 4-bit and 8-bit quantization library for CUDA

Configuration Parameters

Parameter Type Description
model_id str HuggingFace model identifier (e.g., Llama-2-7b-chat-hf); prefixed with meta-llama/ at load time
prompt_token_cost float Cost per 1000 prompt tokens (typically 0.0 for local models)
response_token_cost float Cost per 1000 completion tokens (typically 0.0 for local models)
temperature float Randomness of model output
top_k int Top-K sampling parameter limiting token selection
max_tokens int Maximum sequence length for generation
cache_dir str Local directory for caching downloaded model weights (set as TRANSFORMERS_CACHE)

Quantization Configuration

The model is loaded with BitsAndBytes 4-bit quantization using the following settings:

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
Setting Value Purpose
load_in_4bit True Enable 4-bit weight quantization
bnb_4bit_quant_type "nf4" Use Normal Float 4-bit, optimal for normally distributed weights
bnb_4bit_use_double_quant True Quantize the quantization constants for additional memory savings
bnb_4bit_compute_dtype torch.bfloat16 Perform computation in bfloat16 for numerical stability

I/O Behavior

Input: A JSON configuration file path containing model settings and cache directory.

Output: An initialized Llama2HF instance with a loaded, quantized model ready for text generation.

query()

  • Input: A string query and optional num_responses count.
  • Output: A list of dictionaries, each containing a generated_text key with the model's response.
  • Template formatting: The query is wrapped in the LLaMA-2 chat template before being sent to the model:
<s><<SYS>>You are a helpful assistant. Always follow the intstructions precisely and output the response exactly in the requested format.<</SYS>>

[INST] {query} [/INST]
  • Response processing: The prompt prefix is stripped from each generated sequence, and only the model's response text is retained.
  • Multi-response handling: Generates responses one at a time in a loop (unlike the API-based approach that uses the n parameter).
  • Caching: If caching is enabled, returns cached responses for previously seen queries.

get_response_texts()

  • Input: A list of response dictionaries from query().
  • Output: A flat list of response text strings, extracted from the generated_text field of each dictionary.

Key Implementation Details

  • The TRANSFORMERS_CACHE environment variable is set before importing the transformers module to ensure the cache directory takes effect.
  • The model is loaded with trust_remote_code=True and device_map="auto" for automatic GPU placement across available devices.
  • After loading, model.eval() and torch.no_grad() are called to disable dropout and gradient computation for inference efficiency.
  • The HuggingFace pipeline with task="text-generation" wraps the model and tokenizer for convenient inference.
  • Generation uses do_sample=True with top_k sampling and terminates at the EOS token.

Usage Example

from graph_of_thoughts.language_models import Llama2HF

# Initialize with config for LLaMA-2 7B
lm = Llama2HF(config_path="config.json", model_name="llama7b-hf", cache=False)

# Single query
responses = lm.query("Sort this list: [3, 1, 2]")
texts = lm.get_response_texts(responses)
print(texts)  # ["[1, 2, 3]"]

# Multiple responses
responses = lm.query("Sort this list: [3, 1, 2]", num_responses=3)
all_texts = lm.get_response_texts(responses)
print(len(all_texts))  # 3

Related Pages

GitHub URL

graph_of_thoughts/language_models/llamachat_hf.py (Lines 15-120)

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment