Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:LLMBook zh LLMBook zh github io Inference and Quantization

From Leeroopedia
Revision as of 11:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/LLMBook_zh_LLMBook_zh_github_io_Inference_and_Quantization.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains LLMs, Inference, Model_Compression
Last Updated 2026-02-08 04:30 GMT

Overview

End-to-end inference and model compression workflow covering high-throughput serving with vLLM, basic quantization principles, and practical weight quantization using bitsandbytes and GPTQ methods.

Description

This workflow addresses the deployment phase of the LLM lifecycle, focusing on efficient inference and model compression techniques. It covers four approaches: (1) high-throughput batch inference using the vLLM engine with PagedAttention and continuous batching, (2) fundamental 8-bit symmetric quantization from first principles showing the mathematical mapping between floating-point and integer representations, (3) practical 8-bit and 4-bit weight quantization using the bitsandbytes library integrated with HuggingFace, and (4) GPTQ post-training quantization with calibration data for accuracy-preserving 4-bit compression. Together, these techniques enable deployment of large models on resource-constrained hardware while maintaining acceptable generation quality.

Usage

Execute this workflow when you have a trained language model and need to deploy it for inference, especially when GPU memory is limited or high throughput is required. Use vLLM for production serving scenarios requiring maximum throughput. Use quantization when the model does not fit in available GPU memory at full precision, or when you want to reduce serving costs by fitting the model on fewer or smaller GPUs.

Execution Steps

Step 1: High Throughput Inference with vLLM

Set up the vLLM inference engine for efficient batch generation. Initialize the vLLM LLM class with the target model, then configure sampling parameters including temperature, maximum new tokens, and penalty coefficients. Submit multiple prompts simultaneously for batch processing. vLLM automatically handles PagedAttention for efficient KV-cache memory management and continuous batching for maximum GPU utilization.

Key considerations:

  • Temperature of 0 enables deterministic greedy decoding
  • Prompts should follow the model's expected chat format (e.g., LLaMA-2 uses [INST] markers)
  • vLLM handles dynamic batching and memory management automatically
  • The presence and frequency penalty parameters control repetition in generated text

Step 2: Basic Quantization Principles

Understand the fundamental quantization and dequantization operations. Quantization maps floating-point values to a lower-bit integer representation using a scale factor S and zero-point Z. The scale is computed from the input range and the target integer range. Quantized values are clipped to the valid integer range to prevent overflow. Dequantization reverses the mapping to recover approximate floating-point values, with some precision loss due to rounding.

Key considerations:

  • The scale S = (max - min) / (2^bits - 1) maps the float range to the integer range
  • The zero-point Z ensures that zero maps correctly between representations
  • Clipping prevents out-of-range values from wrapping around
  • The quantization error is bounded by half the scale factor

Step 3: Bitsandbytes Quantization

Apply practical weight quantization using the bitsandbytes library through HuggingFace's integration. Load a pre-trained model with either 8-bit or 4-bit quantization enabled via a single flag. The library handles blockwise quantization, dynamic range computation, and mixed-precision storage automatically. Memory usage is reduced proportionally to the bit reduction factor.

Key considerations:

  • 8-bit quantization (load_in_8bit) typically reduces memory by approximately 50% with minimal quality loss
  • 4-bit quantization (load_in_4bit) reduces memory by approximately 75% with moderate quality impact
  • device_map="auto" distributes model layers across available GPUs automatically
  • No calibration data is required; quantization is applied directly to the weights

Step 4: GPTQ Quantization

Apply GPTQ post-training quantization, which uses calibration data to minimize the quantization error layer by layer. Configure the GPTQConfig with the target bit width, calibration dataset, and tokenizer. Load the model with the quantization config applied. GPTQ solves a layer-wise reconstruction problem to find optimal quantized weights that minimize the output error on the calibration set, producing higher quality 4-bit models than naive quantization.

Key considerations:

  • GPTQ requires a calibration dataset (e.g., "c4") to compute optimal quantization parameters
  • The tokenizer must be provided for processing the calibration data
  • GPTQ generally produces better quality at 4-bit than simple round-to-nearest methods
  • The quantization process is one-time; the resulting model can be saved and reloaded

Execution Diagram

GitHub URL

Workflow Repository