Implementation:Spcl Graph of thoughts Llama2HF

Knowledge Sources	Graph of Thoughts Graph of Thoughts: Solving Elaborate Problems with Large Language Models
Domains	LLM_Integration, Quantization
Principles	Principle:Spcl_Graph_of_thoughts_Local_LLM_Inference
Environments	Environment:Spcl_Graph_of_thoughts_Local_LLaMA_GPU_Inference
Source File	`graph_of_thoughts/language_models/llamachat_hf.py`, Lines 15-120
Last Updated	2026-02-14

Overview

The Llama2HF class provides a concrete implementation of AbstractLanguageModel that runs LLaMA-2 models locally via HuggingFace Transformers with BitsAndBytes 4-bit quantization. It supports configurable model variants, top-k sampling, response caching, and formats all queries using the LLaMA-2 chat template.

Import

from graph_of_thoughts.language_models import Llama2HF

Class Signature

class Llama2HF(AbstractLanguageModel):
    def __init__(
        self, config_path: str = "", model_name: str = "llama7b-hf", cache: bool = False
    ) -> None: ...

    def query(
        self, query: str, num_responses: int = 1
    ) -> List[Dict]: ...

    def get_response_texts(
        self, query_responses: List[Dict]
    ) -> List[str]: ...

External Dependencies

torch -- PyTorch, required for tensor operations and bfloat16 compute dtype
transformers -- HuggingFace Transformers library (provides AutoConfig, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline)
bitsandbytes -- 4-bit and 8-bit quantization library for CUDA

Configuration Parameters

Parameter	Type	Description
`model_id`	`str`	HuggingFace model identifier (e.g., `Llama-2-7b-chat-hf`); prefixed with `meta-llama/` at load time
`prompt_token_cost`	`float`	Cost per 1000 prompt tokens (typically 0.0 for local models)
`response_token_cost`	`float`	Cost per 1000 completion tokens (typically 0.0 for local models)
`temperature`	`float`	Randomness of model output
`top_k`	`int`	Top-K sampling parameter limiting token selection
`max_tokens`	`int`	Maximum sequence length for generation
`cache_dir`	`str`	Local directory for caching downloaded model weights (set as `TRANSFORMERS_CACHE`)

Quantization Configuration

The model is loaded with BitsAndBytes 4-bit quantization using the following settings:

bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Setting	Value	Purpose
`load_in_4bit`	`True`	Enable 4-bit weight quantization
`bnb_4bit_quant_type`	`"nf4"`	Use Normal Float 4-bit, optimal for normally distributed weights
`bnb_4bit_use_double_quant`	`True`	Quantize the quantization constants for additional memory savings
`bnb_4bit_compute_dtype`	`torch.bfloat16`	Perform computation in bfloat16 for numerical stability

I/O Behavior

Input: A JSON configuration file path containing model settings and cache directory.

Output: An initialized Llama2HF instance with a loaded, quantized model ready for text generation.

query()

Input: A string query and optional num_responses count.
Output: A list of dictionaries, each containing a generated_text key with the model's response.
Template formatting: The query is wrapped in the LLaMA-2 chat template before being sent to the model:

<s><<SYS>>You are a helpful assistant. Always follow the intstructions precisely and output the response exactly in the requested format.<</SYS>>

[INST] {query} [/INST]

Response processing: The prompt prefix is stripped from each generated sequence, and only the model's response text is retained.
Multi-response handling: Generates responses one at a time in a loop (unlike the API-based approach that uses the n parameter).
Caching: If caching is enabled, returns cached responses for previously seen queries.

get_response_texts()

Input: A list of response dictionaries from query().
Output: A flat list of response text strings, extracted from the generated_text field of each dictionary.

Key Implementation Details

The TRANSFORMERS_CACHE environment variable is set before importing the transformers module to ensure the cache directory takes effect.
The model is loaded with trust_remote_code=True and device_map="auto" for automatic GPU placement across available devices.
After loading, model.eval() and torch.no_grad() are called to disable dropout and gradient computation for inference efficiency.
The HuggingFace pipeline with task="text-generation" wraps the model and tokenizer for convenient inference.
Generation uses do_sample=True with top_k sampling and terminates at the EOS token.

Usage Example

from graph_of_thoughts.language_models import Llama2HF

# Initialize with config for LLaMA-2 7B
lm = Llama2HF(config_path="config.json", model_name="llama7b-hf", cache=False)

# Single query
responses = lm.query("Sort this list: [3, 1, 2]")
texts = lm.get_response_texts(responses)
print(texts)  # ["[1, 2, 3]"]

# Multiple responses
responses = lm.query("Sort this list: [3, 1, 2]", num_responses=3)
all_texts = lm.get_response_texts(responses)
print(len(all_texts))  # 3

Related Pages

GitHub URL

graph_of_thoughts/language_models/llamachat_hf.py (Lines 15-120)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment