Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Spcl Graph of thoughts Local LLM Inference

From Leeroopedia
Knowledge Sources
Domains LLM_Integration, Quantization
Implementations Implementation:Spcl_Graph_of_thoughts_Llama2HF
Last Updated 2026-02-14

Overview

Integration pattern for running local language models via HuggingFace Transformers with 4-bit quantization for efficient inference.

The Graph of Thoughts framework supports local LLM inference as an alternative to cloud-based APIs. This principle describes the pattern for loading and running large language models locally using the HuggingFace Transformers library, with BitsAndBytes 4-bit quantization to reduce GPU memory requirements while maintaining acceptable output quality.

Core Concepts

BitsAndBytes 4-Bit Quantization (NF4)

Running large language models (e.g., LLaMA-2 7B with ~14GB in FP16) on consumer GPUs requires aggressive memory reduction. The integration uses BitsAndBytes quantization with the following configuration:

  • load_in_4bit=True -- loads model weights in 4-bit precision instead of 16-bit or 32-bit
  • bnb_4bit_quant_type="nf4" -- uses Normal Float 4-bit (NF4) quantization, which is information-theoretically optimal for normally distributed weights
  • bnb_4bit_use_double_quant=True -- applies a second round of quantization to the quantization constants themselves, further reducing memory
  • bnb_4bit_compute_dtype=torch.bfloat16 -- performs actual computation in bfloat16 for numerical stability

This combination typically reduces a 7B parameter model from ~14GB to ~4GB of GPU memory, making it feasible to run on a single consumer GPU.

Local Model Loading

The model loading pattern follows these steps:

  1. Load the model configuration from HuggingFace Hub using AutoConfig.from_pretrained
  2. Construct a BitsAndBytesConfig with the 4-bit quantization settings
  3. Load the tokenizer using AutoTokenizer.from_pretrained
  4. Load the model using AutoModelForCausalLM.from_pretrained with the quantization config and device_map="auto" for automatic GPU placement
  5. Set the model to evaluation mode (model.eval()) and disable gradient computation (torch.no_grad())
  6. Wrap the model and tokenizer in a HuggingFace pipeline for convenient text generation

The TRANSFORMERS_CACHE environment variable is set before importing transformers to control where model weights are cached on disk.

LLaMA-2 Chat Template Formatting

LLaMA-2 chat models require a specific prompt format to produce coherent responses:

<s><<SYS>>You are a helpful assistant. Always follow the instructions precisely and output the response exactly in the requested format.<</SYS>>

[INST] {query} [/INST]

The template components are:

  • <s> -- beginning of sequence token
  • <<SYS>>...<</SYS>> -- system prompt block defining assistant behavior
  • [INST]...[/INST] -- instruction block wrapping the actual user query

Without this formatting, the model may produce incoherent or off-topic output.

Inference Configuration

Local inference uses different generation parameters than API-based models:

  • temperature -- controls output randomness (same concept as API models)
  • top_k -- limits sampling to the top K most probable tokens at each step
  • max_tokens -- maximum sequence length for generation
  • do_sample=True -- enables stochastic sampling (as opposed to greedy decoding)
  • num_return_sequences=1 -- generates one response per call (multiple responses achieved via looping)

Unlike API-based models that support the n parameter for batch responses, local inference generates one sequence at a time in a loop.

Interaction with the Framework

The local LLM integration implements the same AbstractLanguageModel interface as the API-based backends:

  1. query(query, num_responses) -- formats the query with the chat template, generates responses via the pipeline, strips the prompt prefix from outputs
  2. get_response_texts(query_responses) -- extracts the generated_text field from each response dictionary

This ensures that all framework operations (Generate, Score, Improve, etc.) work identically regardless of whether the backend is a cloud API or a local model.

Design Rationale

Local inference with 4-bit quantization addresses two practical concerns:

  • Cost -- local inference has zero marginal API cost, making it suitable for large-scale experiments and development iteration
  • Privacy -- sensitive data never leaves the local machine

The tradeoff is that local models may produce lower-quality output compared to larger API-based models, and inference speed depends on available GPU hardware.

Related Pages

GitHub URL

graph_of_thoughts/language_models/llamachat_hf.py

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment