Implementation:Pytorch Serve IpexLLMHandler

Knowledge Sources	Pytorch_Serve
Domains	LLM_Serving, Quantization, Intel_Optimization
Last Updated	2026-02-13 18:52 GMT

Overview

IpexLLMHandler is an Intel IPEX-optimized TorchServe handler for serving large language models with INT8 quantization. It extends BaseHandler and supports Weight-Only Quantization (WoQ) and SmoothQuant (SQ) techniques via intel_extension_for_pytorch. The handler supports diverse model architectures including T5, Qwen, ChatGLM, and MPT, and includes an inner Evaluator class for model quality assessment.

Description

The IpexLLMHandler class (lines 48-659) provides a specialized handler for running LLMs on Intel CPUs with IPEX quantization optimizations. It handles model loading with quantization configuration, tokenization for multiple model architectures, and text generation inference.

Key Responsibilities

IPEX Quantization: Configures and applies Weight-Only Quantization (WoQ) or SmoothQuant (SQ) via intel_extension_for_pytorch
Multi-Architecture Support: Handles diverse model architectures (T5, Qwen, ChatGLM, MPT, LLaMA, GPT-NeoX, OPT, Bloom, Falcon) with architecture-specific tokenization and generation logic
Model Initialization: Loads HuggingFace models and applies IPEX optimization passes with quantization during initialize()
Evaluation: Includes an inner Evaluator class (lines 232-409) for assessing model quality post-quantization using perplexity and accuracy metrics
Text Generation: Implements the preprocess/inference/postprocess pipeline for text generation tasks

Inner Class: Evaluator

The Evaluator class (lines 232-409) provides methods for evaluating quantized model quality. It measures perplexity on standard benchmarks and compares quantized model outputs against baseline reference outputs.

Usage

# The handler is configured in model-config.yaml:
# handler:
#   model_name: "meta-llama/Llama-2-7b-hf"
#   quantization: "woq"  # or "sq"
#   dtype: "int8"
#   batch_size: 1

# Creating a model archive with the IPEX LLM handler
torch-model-archiver --model-name ipex_llm \
    --handler examples/large_models/ipex_llm_int8/llm_handler.py \
    --config-file model-config.yaml \
    --archive-format no-archive

Code Reference

Source Location

File	Lines	Repository
`examples/large_models/ipex_llm_int8/llm_handler.py`	L1-659	pytorch/serve
`examples/large_models/ipex_llm_int8/llm_handler.py`	L48-659	`IpexLLMHandler` class definition
`examples/large_models/ipex_llm_int8/llm_handler.py`	L232-409	`Evaluator` inner class

Signature

class IpexLLMHandler(BaseHandler):
    """
    IPEX-optimized handler for serving LLMs with INT8 quantization.

    Supports Weight-Only Quantization (WoQ) and SmoothQuant (SQ) techniques
    via intel_extension_for_pytorch. Handles multiple model architectures
    including T5, Qwen, ChatGLM, MPT, LLaMA, and others.

    Attributes:
        model: The quantized and optimized LLM.
        tokenizer: HuggingFace tokenizer instance.
        device (torch.device): Target compute device (typically CPU for IPEX).
        quantization_type (str): 'woq' or 'sq' quantization method.
        dtype (str): Target data type (e.g., 'int8').
    """

    class Evaluator:
        """
        Inner class for evaluating quantized model quality.

        Measures perplexity and accuracy on standard benchmarks to
        assess the impact of quantization on model outputs.

        Attributes:
            model: The quantized model to evaluate.
            tokenizer: Tokenizer for preparing evaluation inputs.
            dataset: Evaluation dataset (e.g., WikiText-2).
        """

        def __init__(self, model, tokenizer, dataset_name="wikitext"):
            """
            Initialize the evaluator.

            Args:
                model: The quantized model.
                tokenizer: Tokenizer matching the model.
                dataset_name (str): Name of the evaluation dataset.
            """
            ...

        def evaluate(self, max_length=512, stride=256):
            """
            Run evaluation and compute perplexity.

            Args:
                max_length (int): Maximum sequence length.
                stride (int): Stride for sliding window evaluation.

            Returns:
                dict: Evaluation metrics including perplexity.
            """
            ...

    def initialize(self, context):
        """
        Load model with IPEX quantization.

        Loads HuggingFace model, applies IPEX optimization with WoQ or SQ
        quantization, and initializes the tokenizer.

        Args:
            context: TorchServe context with system_properties and model_yaml_config.
        """
        ...

    def preprocess(self, data):
        """
        Tokenize input text for the specific model architecture.

        Handles architecture-specific tokenization patterns for T5, Qwen,
        ChatGLM, MPT, and other supported models.

        Args:
            data (list): List of request dicts with text input.

        Returns:
            dict: Tokenized inputs with input_ids and attention_mask.
        """
        ...

    def inference(self, data, *args, **kwargs):
        """
        Generate text using the IPEX-optimized model.

        Args:
            data (dict): Tokenized inputs from preprocess().

        Returns:
            torch.Tensor: Generated token IDs.
        """
        ...

    def postprocess(self, data):
        """
        Decode generated tokens to text.

        Args:
            data (torch.Tensor): Generated token IDs.

        Returns:
            list[str]: Decoded text strings.
        """
        ...

Import

# Handler is loaded by TorchServe from the model archive.
# Internal imports used by the handler:
import torch
import intel_extension_for_pytorch as ipex
from transformers import AutoModelForCausalLM, AutoTokenizer
from ts.torch_handler.base_handler import BaseHandler

I/O Contract

Method	Input	Output	Notes
`initialize(context)`	`context`: Context with `system_properties`, `model_yaml_config`	None (sets `self.model`, `self.tokenizer`)	Applies IPEX WoQ or SQ quantization
`preprocess(data)`	`data`: `list[dict]` with text in `"data"` or `"body"` key	`dict` with `input_ids` and `attention_mask` tensors	Architecture-specific tokenization
`inference(data)`	`data`: `dict` with tokenized inputs	`torch.Tensor` of generated token IDs	Uses IPEX-optimized model forward pass
`postprocess(data)`	`data`: `torch.Tensor` of token IDs	`list[str]` decoded text strings	Tokenizer decode
`Evaluator.__init__(model, tokenizer, dataset_name)`	Model, tokenizer, dataset name	`Evaluator` instance	Lines 232-250
`Evaluator.evaluate(max_length, stride)`	max_length (`int`), stride (`int`)	`dict` with perplexity and accuracy metrics	Sliding window evaluation

Supported Model Architectures

Architecture	Model Examples	Special Handling
T5	google/flan-t5-xl, google/flan-t5-xxl	Encoder-decoder; uses `AutoModelForSeq2SeqLM`
Qwen	Qwen/Qwen-7B-Chat, Qwen/Qwen-14B	Custom chat template tokenization
ChatGLM	THUDM/chatglm2-6b, THUDM/chatglm3-6b	Custom tokenization with `build_chat_input()`
MPT	mosaicml/mpt-7b, mosaicml/mpt-30b	Custom attention configuration
LLaMA	meta-llama/Llama-2-7b-hf	Standard causal LM tokenization
GPT-NeoX	EleutherAI/gpt-neox-20b	Standard causal LM tokenization
OPT	facebook/opt-6.7b, facebook/opt-30b	Standard causal LM tokenization
Bloom	bigscience/bloom-7b1	Standard causal LM tokenization
Falcon	tiiuae/falcon-7b, tiiuae/falcon-40b	Standard causal LM tokenization

Quantization Methods

Method	Key	Description
Weight-Only Quantization (WoQ)	`"woq"`	Quantizes only model weights to INT8; activations remain in FP32. Lower memory footprint with minimal accuracy loss.
SmoothQuant (SQ)	`"sq"`	Applies channel-wise smoothing to activations before quantization. Both weights and activations are quantized for maximum throughput.

Usage Examples

Example 1: Serving a Llama-2 model with WoQ quantization

# model-config.yaml
minWorkers: 1
maxWorkers: 1
handler:
    model_name: "meta-llama/Llama-2-7b-hf"
    quantization: "woq"
    dtype: "int8"
    max_new_tokens: 256
    batch_size: 1

# Start TorchServe with IPEX-optimized model
torchserve --start --ncs --model-store model_store \
    --models ipex_llm=ipex_llm.mar

# Send an inference request
curl -X POST http://localhost:8080/predictions/ipex_llm \
    -H "Content-Type: application/json" \
    -d '{"data": "Explain the benefits of model quantization:"}'

Example 2: SmoothQuant configuration for maximum throughput

# model-config.yaml for SmoothQuant
minWorkers: 1
maxWorkers: 1
handler:
    model_name: "meta-llama/Llama-2-13b-hf"
    quantization: "sq"
    dtype: "int8"
    max_new_tokens: 512
    batch_size: 4
    smooth_quant_alpha: 0.5

Example 3: Using the Evaluator to assess quantization quality

# After loading the quantized model in initialize()
evaluator = IpexLLMHandler.Evaluator(
    model=self.model,
    tokenizer=self.tokenizer,
    dataset_name="wikitext"
)

metrics = evaluator.evaluate(max_length=512, stride=256)
print(f"Perplexity: {metrics['perplexity']:.2f}")
print(f"Accuracy: {metrics['accuracy']:.4f}")

Related Pages

Principle:Pytorch_Serve_IPEX_Quantized_Inference -- The principle of IPEX-based quantized inference for efficient CPU serving
Implementation:Pytorch_Serve_BaseHandler -- Base handler class extended by IpexLLMHandler
Implementation:Pytorch_Serve_Accelerate_Handler -- Alternative handler using HuggingFace Accelerate for distributed inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment