Principle:Spcl Graph of thoughts Local LLM Inference
| Knowledge Sources | |
|---|---|
| Domains | LLM_Integration, Quantization |
| Implementations | Implementation:Spcl_Graph_of_thoughts_Llama2HF |
| Last Updated | 2026-02-14 |
Overview
Integration pattern for running local language models via HuggingFace Transformers with 4-bit quantization for efficient inference.
The Graph of Thoughts framework supports local LLM inference as an alternative to cloud-based APIs. This principle describes the pattern for loading and running large language models locally using the HuggingFace Transformers library, with BitsAndBytes 4-bit quantization to reduce GPU memory requirements while maintaining acceptable output quality.
Core Concepts
BitsAndBytes 4-Bit Quantization (NF4)
Running large language models (e.g., LLaMA-2 7B with ~14GB in FP16) on consumer GPUs requires aggressive memory reduction. The integration uses BitsAndBytes quantization with the following configuration:
- load_in_4bit=True -- loads model weights in 4-bit precision instead of 16-bit or 32-bit
- bnb_4bit_quant_type="nf4" -- uses Normal Float 4-bit (NF4) quantization, which is information-theoretically optimal for normally distributed weights
- bnb_4bit_use_double_quant=True -- applies a second round of quantization to the quantization constants themselves, further reducing memory
- bnb_4bit_compute_dtype=torch.bfloat16 -- performs actual computation in bfloat16 for numerical stability
This combination typically reduces a 7B parameter model from ~14GB to ~4GB of GPU memory, making it feasible to run on a single consumer GPU.
Local Model Loading
The model loading pattern follows these steps:
- Load the model configuration from HuggingFace Hub using
AutoConfig.from_pretrained - Construct a
BitsAndBytesConfigwith the 4-bit quantization settings - Load the tokenizer using
AutoTokenizer.from_pretrained - Load the model using
AutoModelForCausalLM.from_pretrainedwith the quantization config anddevice_map="auto"for automatic GPU placement - Set the model to evaluation mode (
model.eval()) and disable gradient computation (torch.no_grad()) - Wrap the model and tokenizer in a HuggingFace
pipelinefor convenient text generation
The TRANSFORMERS_CACHE environment variable is set before importing transformers to control where model weights are cached on disk.
LLaMA-2 Chat Template Formatting
LLaMA-2 chat models require a specific prompt format to produce coherent responses:
<s><<SYS>>You are a helpful assistant. Always follow the instructions precisely and output the response exactly in the requested format.<</SYS>>
[INST] {query} [/INST]
The template components are:
<s>-- beginning of sequence token<<SYS>>...<</SYS>>-- system prompt block defining assistant behavior[INST]...[/INST]-- instruction block wrapping the actual user query
Without this formatting, the model may produce incoherent or off-topic output.
Inference Configuration
Local inference uses different generation parameters than API-based models:
- temperature -- controls output randomness (same concept as API models)
- top_k -- limits sampling to the top K most probable tokens at each step
- max_tokens -- maximum sequence length for generation
- do_sample=True -- enables stochastic sampling (as opposed to greedy decoding)
- num_return_sequences=1 -- generates one response per call (multiple responses achieved via looping)
Unlike API-based models that support the n parameter for batch responses, local inference generates one sequence at a time in a loop.
Interaction with the Framework
The local LLM integration implements the same AbstractLanguageModel interface as the API-based backends:
- query(query, num_responses) -- formats the query with the chat template, generates responses via the pipeline, strips the prompt prefix from outputs
- get_response_texts(query_responses) -- extracts the
generated_textfield from each response dictionary
This ensures that all framework operations (Generate, Score, Improve, etc.) work identically regardless of whether the backend is a cloud API or a local model.
Design Rationale
Local inference with 4-bit quantization addresses two practical concerns:
- Cost -- local inference has zero marginal API cost, making it suitable for large-scale experiments and development iteration
- Privacy -- sensitive data never leaves the local machine
The tradeoff is that local models may produce lower-quality output compared to larger API-based models, and inference speed depends on available GPU hardware.
Related Pages
- Implementation:Spcl_Graph_of_thoughts_Llama2HF
- Heuristic:Spcl_Graph_of_thoughts_Four_Bit_Quantization_For_Local_LLMs