Implementation:Spcl Graph of thoughts Llama2HF
| Knowledge Sources | |
|---|---|
| Domains | LLM_Integration, Quantization |
| Principles | Principle:Spcl_Graph_of_thoughts_Local_LLM_Inference |
| Environments | Environment:Spcl_Graph_of_thoughts_Local_LLaMA_GPU_Inference |
| Source File | graph_of_thoughts/language_models/llamachat_hf.py, Lines 15-120
|
| Last Updated | 2026-02-14 |
Overview
The Llama2HF class provides a concrete implementation of AbstractLanguageModel that runs LLaMA-2 models locally via HuggingFace Transformers with BitsAndBytes 4-bit quantization. It supports configurable model variants, top-k sampling, response caching, and formats all queries using the LLaMA-2 chat template.
Import
from graph_of_thoughts.language_models import Llama2HF
Class Signature
class Llama2HF(AbstractLanguageModel):
def __init__(
self, config_path: str = "", model_name: str = "llama7b-hf", cache: bool = False
) -> None: ...
def query(
self, query: str, num_responses: int = 1
) -> List[Dict]: ...
def get_response_texts(
self, query_responses: List[Dict]
) -> List[str]: ...
External Dependencies
- torch -- PyTorch, required for tensor operations and bfloat16 compute dtype
- transformers -- HuggingFace Transformers library (provides
AutoConfig,AutoTokenizer,AutoModelForCausalLM,BitsAndBytesConfig,pipeline) - bitsandbytes -- 4-bit and 8-bit quantization library for CUDA
Configuration Parameters
| Parameter | Type | Description |
|---|---|---|
model_id |
str |
HuggingFace model identifier (e.g., Llama-2-7b-chat-hf); prefixed with meta-llama/ at load time
|
prompt_token_cost |
float |
Cost per 1000 prompt tokens (typically 0.0 for local models) |
response_token_cost |
float |
Cost per 1000 completion tokens (typically 0.0 for local models) |
temperature |
float |
Randomness of model output |
top_k |
int |
Top-K sampling parameter limiting token selection |
max_tokens |
int |
Maximum sequence length for generation |
cache_dir |
str |
Local directory for caching downloaded model weights (set as TRANSFORMERS_CACHE)
|
Quantization Configuration
The model is loaded with BitsAndBytes 4-bit quantization using the following settings:
bnb_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
| Setting | Value | Purpose |
|---|---|---|
load_in_4bit |
True |
Enable 4-bit weight quantization |
bnb_4bit_quant_type |
"nf4" |
Use Normal Float 4-bit, optimal for normally distributed weights |
bnb_4bit_use_double_quant |
True |
Quantize the quantization constants for additional memory savings |
bnb_4bit_compute_dtype |
torch.bfloat16 |
Perform computation in bfloat16 for numerical stability |
I/O Behavior
Input: A JSON configuration file path containing model settings and cache directory.
Output: An initialized Llama2HF instance with a loaded, quantized model ready for text generation.
query()
- Input: A string query and optional
num_responsescount. - Output: A list of dictionaries, each containing a
generated_textkey with the model's response. - Template formatting: The query is wrapped in the LLaMA-2 chat template before being sent to the model:
<s><<SYS>>You are a helpful assistant. Always follow the intstructions precisely and output the response exactly in the requested format.<</SYS>>
[INST] {query} [/INST]
- Response processing: The prompt prefix is stripped from each generated sequence, and only the model's response text is retained.
- Multi-response handling: Generates responses one at a time in a loop (unlike the API-based approach that uses the
nparameter). - Caching: If caching is enabled, returns cached responses for previously seen queries.
get_response_texts()
- Input: A list of response dictionaries from
query(). - Output: A flat list of response text strings, extracted from the
generated_textfield of each dictionary.
Key Implementation Details
- The
TRANSFORMERS_CACHEenvironment variable is set before importing thetransformersmodule to ensure the cache directory takes effect. - The model is loaded with
trust_remote_code=Trueanddevice_map="auto"for automatic GPU placement across available devices. - After loading,
model.eval()andtorch.no_grad()are called to disable dropout and gradient computation for inference efficiency. - The HuggingFace
pipelinewithtask="text-generation"wraps the model and tokenizer for convenient inference. - Generation uses
do_sample=Truewithtop_ksampling and terminates at the EOS token.
Usage Example
from graph_of_thoughts.language_models import Llama2HF
# Initialize with config for LLaMA-2 7B
lm = Llama2HF(config_path="config.json", model_name="llama7b-hf", cache=False)
# Single query
responses = lm.query("Sort this list: [3, 1, 2]")
texts = lm.get_response_texts(responses)
print(texts) # ["[1, 2, 3]"]
# Multiple responses
responses = lm.query("Sort this list: [3, 1, 2]", num_responses=3)
all_texts = lm.get_response_texts(responses)
print(len(all_texts)) # 3
Related Pages
- Principle:Spcl_Graph_of_thoughts_Local_LLM_Inference
- Environment:Spcl_Graph_of_thoughts_Python_3_8_Runtime
- Environment:Spcl_Graph_of_thoughts_Local_LLaMA_GPU_Inference
- Heuristic:Spcl_Graph_of_thoughts_Four_Bit_Quantization_For_Local_LLMs
GitHub URL
graph_of_thoughts/language_models/llamachat_hf.py (Lines 15-120)