Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Pytorch Serve IpexLLMHandler

From Leeroopedia
Revision as of 13:46, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Pytorch_Serve_IpexLLMHandler.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains LLM_Serving, Quantization, Intel_Optimization
Last Updated 2026-02-13 18:52 GMT

Overview

IpexLLMHandler is an Intel IPEX-optimized TorchServe handler for serving large language models with INT8 quantization. It extends BaseHandler and supports Weight-Only Quantization (WoQ) and SmoothQuant (SQ) techniques via intel_extension_for_pytorch. The handler supports diverse model architectures including T5, Qwen, ChatGLM, and MPT, and includes an inner Evaluator class for model quality assessment.

Description

The IpexLLMHandler class (lines 48-659) provides a specialized handler for running LLMs on Intel CPUs with IPEX quantization optimizations. It handles model loading with quantization configuration, tokenization for multiple model architectures, and text generation inference.

Key Responsibilities

  • IPEX Quantization: Configures and applies Weight-Only Quantization (WoQ) or SmoothQuant (SQ) via intel_extension_for_pytorch
  • Multi-Architecture Support: Handles diverse model architectures (T5, Qwen, ChatGLM, MPT, LLaMA, GPT-NeoX, OPT, Bloom, Falcon) with architecture-specific tokenization and generation logic
  • Model Initialization: Loads HuggingFace models and applies IPEX optimization passes with quantization during initialize()
  • Evaluation: Includes an inner Evaluator class (lines 232-409) for assessing model quality post-quantization using perplexity and accuracy metrics
  • Text Generation: Implements the preprocess/inference/postprocess pipeline for text generation tasks

Inner Class: Evaluator

The Evaluator class (lines 232-409) provides methods for evaluating quantized model quality. It measures perplexity on standard benchmarks and compares quantized model outputs against baseline reference outputs.

Usage

# The handler is configured in model-config.yaml:
# handler:
#   model_name: "meta-llama/Llama-2-7b-hf"
#   quantization: "woq"  # or "sq"
#   dtype: "int8"
#   batch_size: 1
# Creating a model archive with the IPEX LLM handler
torch-model-archiver --model-name ipex_llm \
    --handler examples/large_models/ipex_llm_int8/llm_handler.py \
    --config-file model-config.yaml \
    --archive-format no-archive

Code Reference

Source Location

File Lines Repository
examples/large_models/ipex_llm_int8/llm_handler.py L1-659 pytorch/serve
examples/large_models/ipex_llm_int8/llm_handler.py L48-659 IpexLLMHandler class definition
examples/large_models/ipex_llm_int8/llm_handler.py L232-409 Evaluator inner class

Signature

class IpexLLMHandler(BaseHandler):
    """
    IPEX-optimized handler for serving LLMs with INT8 quantization.

    Supports Weight-Only Quantization (WoQ) and SmoothQuant (SQ) techniques
    via intel_extension_for_pytorch. Handles multiple model architectures
    including T5, Qwen, ChatGLM, MPT, LLaMA, and others.

    Attributes:
        model: The quantized and optimized LLM.
        tokenizer: HuggingFace tokenizer instance.
        device (torch.device): Target compute device (typically CPU for IPEX).
        quantization_type (str): 'woq' or 'sq' quantization method.
        dtype (str): Target data type (e.g., 'int8').
    """

    class Evaluator:
        """
        Inner class for evaluating quantized model quality.

        Measures perplexity and accuracy on standard benchmarks to
        assess the impact of quantization on model outputs.

        Attributes:
            model: The quantized model to evaluate.
            tokenizer: Tokenizer for preparing evaluation inputs.
            dataset: Evaluation dataset (e.g., WikiText-2).
        """

        def __init__(self, model, tokenizer, dataset_name="wikitext"):
            """
            Initialize the evaluator.

            Args:
                model: The quantized model.
                tokenizer: Tokenizer matching the model.
                dataset_name (str): Name of the evaluation dataset.
            """
            ...

        def evaluate(self, max_length=512, stride=256):
            """
            Run evaluation and compute perplexity.

            Args:
                max_length (int): Maximum sequence length.
                stride (int): Stride for sliding window evaluation.

            Returns:
                dict: Evaluation metrics including perplexity.
            """
            ...

    def initialize(self, context):
        """
        Load model with IPEX quantization.

        Loads HuggingFace model, applies IPEX optimization with WoQ or SQ
        quantization, and initializes the tokenizer.

        Args:
            context: TorchServe context with system_properties and model_yaml_config.
        """
        ...

    def preprocess(self, data):
        """
        Tokenize input text for the specific model architecture.

        Handles architecture-specific tokenization patterns for T5, Qwen,
        ChatGLM, MPT, and other supported models.

        Args:
            data (list): List of request dicts with text input.

        Returns:
            dict: Tokenized inputs with input_ids and attention_mask.
        """
        ...

    def inference(self, data, *args, **kwargs):
        """
        Generate text using the IPEX-optimized model.

        Args:
            data (dict): Tokenized inputs from preprocess().

        Returns:
            torch.Tensor: Generated token IDs.
        """
        ...

    def postprocess(self, data):
        """
        Decode generated tokens to text.

        Args:
            data (torch.Tensor): Generated token IDs.

        Returns:
            list[str]: Decoded text strings.
        """
        ...

Import

# Handler is loaded by TorchServe from the model archive.
# Internal imports used by the handler:
import torch
import intel_extension_for_pytorch as ipex
from transformers import AutoModelForCausalLM, AutoTokenizer
from ts.torch_handler.base_handler import BaseHandler

I/O Contract

Method Input Output Notes
initialize(context) context: Context with system_properties, model_yaml_config None (sets self.model, self.tokenizer) Applies IPEX WoQ or SQ quantization
preprocess(data) data: list[dict] with text in "data" or "body" key dict with input_ids and attention_mask tensors Architecture-specific tokenization
inference(data) data: dict with tokenized inputs torch.Tensor of generated token IDs Uses IPEX-optimized model forward pass
postprocess(data) data: torch.Tensor of token IDs list[str] decoded text strings Tokenizer decode
Evaluator.__init__(model, tokenizer, dataset_name) Model, tokenizer, dataset name Evaluator instance Lines 232-250
Evaluator.evaluate(max_length, stride) max_length (int), stride (int) dict with perplexity and accuracy metrics Sliding window evaluation

Supported Model Architectures

Architecture Model Examples Special Handling
T5 google/flan-t5-xl, google/flan-t5-xxl Encoder-decoder; uses AutoModelForSeq2SeqLM
Qwen Qwen/Qwen-7B-Chat, Qwen/Qwen-14B Custom chat template tokenization
ChatGLM THUDM/chatglm2-6b, THUDM/chatglm3-6b Custom tokenization with build_chat_input()
MPT mosaicml/mpt-7b, mosaicml/mpt-30b Custom attention configuration
LLaMA meta-llama/Llama-2-7b-hf Standard causal LM tokenization
GPT-NeoX EleutherAI/gpt-neox-20b Standard causal LM tokenization
OPT facebook/opt-6.7b, facebook/opt-30b Standard causal LM tokenization
Bloom bigscience/bloom-7b1 Standard causal LM tokenization
Falcon tiiuae/falcon-7b, tiiuae/falcon-40b Standard causal LM tokenization

Quantization Methods

Method Key Description
Weight-Only Quantization (WoQ) "woq" Quantizes only model weights to INT8; activations remain in FP32. Lower memory footprint with minimal accuracy loss.
SmoothQuant (SQ) "sq" Applies channel-wise smoothing to activations before quantization. Both weights and activations are quantized for maximum throughput.

Usage Examples

Example 1: Serving a Llama-2 model with WoQ quantization

# model-config.yaml
minWorkers: 1
maxWorkers: 1
handler:
    model_name: "meta-llama/Llama-2-7b-hf"
    quantization: "woq"
    dtype: "int8"
    max_new_tokens: 256
    batch_size: 1
# Start TorchServe with IPEX-optimized model
torchserve --start --ncs --model-store model_store \
    --models ipex_llm=ipex_llm.mar

# Send an inference request
curl -X POST http://localhost:8080/predictions/ipex_llm \
    -H "Content-Type: application/json" \
    -d '{"data": "Explain the benefits of model quantization:"}'

Example 2: SmoothQuant configuration for maximum throughput

# model-config.yaml for SmoothQuant
minWorkers: 1
maxWorkers: 1
handler:
    model_name: "meta-llama/Llama-2-13b-hf"
    quantization: "sq"
    dtype: "int8"
    max_new_tokens: 512
    batch_size: 4
    smooth_quant_alpha: 0.5

Example 3: Using the Evaluator to assess quantization quality

# After loading the quantized model in initialize()
evaluator = IpexLLMHandler.Evaluator(
    model=self.model,
    tokenizer=self.tokenizer,
    dataset_name="wikitext"
)

metrics = evaluator.evaluate(max_length=512, stride=256)
print(f"Perplexity: {metrics['perplexity']:.2f}")
print(f"Accuracy: {metrics['accuracy']:.4f}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment