Overview
IpexLLMHandler is an Intel IPEX-optimized TorchServe handler for serving large language models with INT8 quantization. It extends BaseHandler and supports Weight-Only Quantization (WoQ) and SmoothQuant (SQ) techniques via intel_extension_for_pytorch. The handler supports diverse model architectures including T5, Qwen, ChatGLM, and MPT, and includes an inner Evaluator class for model quality assessment.
Description
The IpexLLMHandler class (lines 48-659) provides a specialized handler for running LLMs on Intel CPUs with IPEX quantization optimizations. It handles model loading with quantization configuration, tokenization for multiple model architectures, and text generation inference.
Key Responsibilities
- IPEX Quantization: Configures and applies Weight-Only Quantization (WoQ) or SmoothQuant (SQ) via
intel_extension_for_pytorch
- Multi-Architecture Support: Handles diverse model architectures (T5, Qwen, ChatGLM, MPT, LLaMA, GPT-NeoX, OPT, Bloom, Falcon) with architecture-specific tokenization and generation logic
- Model Initialization: Loads HuggingFace models and applies IPEX optimization passes with quantization during
initialize()
- Evaluation: Includes an inner
Evaluator class (lines 232-409) for assessing model quality post-quantization using perplexity and accuracy metrics
- Text Generation: Implements the preprocess/inference/postprocess pipeline for text generation tasks
Inner Class: Evaluator
The Evaluator class (lines 232-409) provides methods for evaluating quantized model quality. It measures perplexity on standard benchmarks and compares quantized model outputs against baseline reference outputs.
Usage
# The handler is configured in model-config.yaml:
# handler:
# model_name: "meta-llama/Llama-2-7b-hf"
# quantization: "woq" # or "sq"
# dtype: "int8"
# batch_size: 1
# Creating a model archive with the IPEX LLM handler
torch-model-archiver --model-name ipex_llm \
--handler examples/large_models/ipex_llm_int8/llm_handler.py \
--config-file model-config.yaml \
--archive-format no-archive
Code Reference
Source Location
| File |
Lines |
Repository
|
examples/large_models/ipex_llm_int8/llm_handler.py |
L1-659 |
pytorch/serve
|
examples/large_models/ipex_llm_int8/llm_handler.py |
L48-659 |
IpexLLMHandler class definition
|
examples/large_models/ipex_llm_int8/llm_handler.py |
L232-409 |
Evaluator inner class
|
Signature
class IpexLLMHandler(BaseHandler):
"""
IPEX-optimized handler for serving LLMs with INT8 quantization.
Supports Weight-Only Quantization (WoQ) and SmoothQuant (SQ) techniques
via intel_extension_for_pytorch. Handles multiple model architectures
including T5, Qwen, ChatGLM, MPT, LLaMA, and others.
Attributes:
model: The quantized and optimized LLM.
tokenizer: HuggingFace tokenizer instance.
device (torch.device): Target compute device (typically CPU for IPEX).
quantization_type (str): 'woq' or 'sq' quantization method.
dtype (str): Target data type (e.g., 'int8').
"""
class Evaluator:
"""
Inner class for evaluating quantized model quality.
Measures perplexity and accuracy on standard benchmarks to
assess the impact of quantization on model outputs.
Attributes:
model: The quantized model to evaluate.
tokenizer: Tokenizer for preparing evaluation inputs.
dataset: Evaluation dataset (e.g., WikiText-2).
"""
def __init__(self, model, tokenizer, dataset_name="wikitext"):
"""
Initialize the evaluator.
Args:
model: The quantized model.
tokenizer: Tokenizer matching the model.
dataset_name (str): Name of the evaluation dataset.
"""
...
def evaluate(self, max_length=512, stride=256):
"""
Run evaluation and compute perplexity.
Args:
max_length (int): Maximum sequence length.
stride (int): Stride for sliding window evaluation.
Returns:
dict: Evaluation metrics including perplexity.
"""
...
def initialize(self, context):
"""
Load model with IPEX quantization.
Loads HuggingFace model, applies IPEX optimization with WoQ or SQ
quantization, and initializes the tokenizer.
Args:
context: TorchServe context with system_properties and model_yaml_config.
"""
...
def preprocess(self, data):
"""
Tokenize input text for the specific model architecture.
Handles architecture-specific tokenization patterns for T5, Qwen,
ChatGLM, MPT, and other supported models.
Args:
data (list): List of request dicts with text input.
Returns:
dict: Tokenized inputs with input_ids and attention_mask.
"""
...
def inference(self, data, *args, **kwargs):
"""
Generate text using the IPEX-optimized model.
Args:
data (dict): Tokenized inputs from preprocess().
Returns:
torch.Tensor: Generated token IDs.
"""
...
def postprocess(self, data):
"""
Decode generated tokens to text.
Args:
data (torch.Tensor): Generated token IDs.
Returns:
list[str]: Decoded text strings.
"""
...
Import
# Handler is loaded by TorchServe from the model archive.
# Internal imports used by the handler:
import torch
import intel_extension_for_pytorch as ipex
from transformers import AutoModelForCausalLM, AutoTokenizer
from ts.torch_handler.base_handler import BaseHandler
I/O Contract
| Method |
Input |
Output |
Notes
|
initialize(context) |
context: Context with system_properties, model_yaml_config |
None (sets self.model, self.tokenizer) |
Applies IPEX WoQ or SQ quantization
|
preprocess(data) |
data: list[dict] with text in "data" or "body" key |
dict with input_ids and attention_mask tensors |
Architecture-specific tokenization
|
inference(data) |
data: dict with tokenized inputs |
torch.Tensor of generated token IDs |
Uses IPEX-optimized model forward pass
|
postprocess(data) |
data: torch.Tensor of token IDs |
list[str] decoded text strings |
Tokenizer decode
|
Evaluator.__init__(model, tokenizer, dataset_name) |
Model, tokenizer, dataset name |
Evaluator instance |
Lines 232-250
|
Evaluator.evaluate(max_length, stride) |
max_length (int), stride (int) |
dict with perplexity and accuracy metrics |
Sliding window evaluation
|
Supported Model Architectures
| Architecture |
Model Examples |
Special Handling
|
| T5 |
google/flan-t5-xl, google/flan-t5-xxl |
Encoder-decoder; uses AutoModelForSeq2SeqLM
|
| Qwen |
Qwen/Qwen-7B-Chat, Qwen/Qwen-14B |
Custom chat template tokenization
|
| ChatGLM |
THUDM/chatglm2-6b, THUDM/chatglm3-6b |
Custom tokenization with build_chat_input()
|
| MPT |
mosaicml/mpt-7b, mosaicml/mpt-30b |
Custom attention configuration
|
| LLaMA |
meta-llama/Llama-2-7b-hf |
Standard causal LM tokenization
|
| GPT-NeoX |
EleutherAI/gpt-neox-20b |
Standard causal LM tokenization
|
| OPT |
facebook/opt-6.7b, facebook/opt-30b |
Standard causal LM tokenization
|
| Bloom |
bigscience/bloom-7b1 |
Standard causal LM tokenization
|
| Falcon |
tiiuae/falcon-7b, tiiuae/falcon-40b |
Standard causal LM tokenization
|
Quantization Methods
| Method |
Key |
Description
|
| Weight-Only Quantization (WoQ) |
"woq" |
Quantizes only model weights to INT8; activations remain in FP32. Lower memory footprint with minimal accuracy loss.
|
| SmoothQuant (SQ) |
"sq" |
Applies channel-wise smoothing to activations before quantization. Both weights and activations are quantized for maximum throughput.
|
Usage Examples
Example 1: Serving a Llama-2 model with WoQ quantization
# model-config.yaml
minWorkers: 1
maxWorkers: 1
handler:
model_name: "meta-llama/Llama-2-7b-hf"
quantization: "woq"
dtype: "int8"
max_new_tokens: 256
batch_size: 1
# Start TorchServe with IPEX-optimized model
torchserve --start --ncs --model-store model_store \
--models ipex_llm=ipex_llm.mar
# Send an inference request
curl -X POST http://localhost:8080/predictions/ipex_llm \
-H "Content-Type: application/json" \
-d '{"data": "Explain the benefits of model quantization:"}'
Example 2: SmoothQuant configuration for maximum throughput
# model-config.yaml for SmoothQuant
minWorkers: 1
maxWorkers: 1
handler:
model_name: "meta-llama/Llama-2-13b-hf"
quantization: "sq"
dtype: "int8"
max_new_tokens: 512
batch_size: 4
smooth_quant_alpha: 0.5
Example 3: Using the Evaluator to assess quantization quality
# After loading the quantized model in initialize()
evaluator = IpexLLMHandler.Evaluator(
model=self.model,
tokenizer=self.tokenizer,
dataset_name="wikitext"
)
metrics = evaluator.evaluate(max_length=512, stride=256)
print(f"Perplexity: {metrics['perplexity']:.2f}")
print(f"Accuracy: {metrics['accuracy']:.4f}")
Related Pages