Workflow:Huggingface Optimum Accelerated Inference Pipeline

Knowledge Sources	Huggingface Optimum Optimum Docs
Domains	Inference, Model_Optimization, MLOps
Last Updated	2026-02-15 00:00 GMT

Overview

End-to-end process for running accelerated model inference using the Optimum pipeline API with hardware-specific backends (ONNX Runtime, OpenVINO, Intel IPEX).

Description

This workflow describes how to use the optimum.pipelines.pipeline() factory function to create an inference pipeline backed by an optimized runtime. The function mirrors the familiar transformers.pipeline() API but routes model loading and execution to a hardware-accelerated backend. It automatically detects which backend packages are installed (optimum-onnx for ONNX Runtime, optimum-intel for OpenVINO/IPEX) and delegates to the appropriate backend-specific pipeline implementation.

Key aspects:

Drop-in replacement for transformers.pipeline() with an accelerator parameter
Supports 30+ tasks (text classification, generation, QA, image classification, speech recognition, etc.)
Three accelerator backends: ONNX Runtime ("ort"), OpenVINO ("ov"), Intel IPEX ("ipex")
Automatic backend detection when accelerator is not explicitly specified
Full compatibility with Hugging Face Hub model identifiers and local model paths

Usage

Execute this workflow when you want to run model inference with hardware acceleration without writing backend-specific code. This is the primary entry point for users who want faster inference on their existing Hugging Face models by leveraging ONNX Runtime (GPU/CPU), OpenVINO (Intel hardware), or IPEX (Intel PyTorch extension). The pipeline handles model loading, preprocessing, inference, and post-processing.

Execution Steps

Step 1: Backend Detection

Determine which inference backend to use. If the accelerator parameter is explicitly provided ("ort", "ov", or "ipex"), that backend is selected. Otherwise, the system checks which optimum subpackages are installed and selects the first available backend in priority order: OpenVINO, ONNX Runtime, IPEX.

Key considerations:

OpenVINO requires optimum-intel[openvino] to be installed
ONNX Runtime requires optimum-onnx[onnxruntime] to be installed
IPEX requires optimum-intel[ipex] to be installed
If no backend is available, an ImportError is raised with installation instructions

Step 2: Pipeline Configuration

Configure the pipeline parameters including the task, model identifier, tokenizer/processor, device placement, and dtype settings. The parameters mirror the standard transformers pipeline API, allowing direct migration from existing code.

Key considerations:

The task parameter determines which pipeline class is instantiated
Model can be a Hub ID string, local path, or pre-loaded optimized model object
Tokenizer, feature extractor, image processor, and processor can be auto-resolved or explicitly provided
Device and device_map control hardware placement; they should not be used simultaneously
trust_remote_code enables models with custom code on the Hub

Step 3: Backend-specific Pipeline Instantiation

Delegate to the backend-specific pipeline factory. For ONNX Runtime, this imports and calls optimum.onnxruntime.pipeline(). For OpenVINO or IPEX, this imports and calls optimum.intel.pipeline(). The backend-specific factory handles model conversion (if needed), optimization, and runtime initialization.

Key considerations:

ONNX Runtime pipeline loads or converts models to ONNX format
OpenVINO pipeline loads or converts models to OpenVINO IR format
IPEX pipeline applies Intel-specific PyTorch optimizations
All backends return a standard transformers Pipeline object

Step 4: Model Loading and Optimization

The backend loads the model in its optimized format. If the model is only available as a PyTorch checkpoint, the backend may perform on-the-fly conversion. The model is placed on the specified device and any backend-specific optimizations (graph optimization, operator fusion, quantization) are applied.

Key considerations:

Pre-exported models (ONNX, OpenVINO IR) are loaded directly for fastest startup
On-the-fly conversion adds initial latency but provides seamless experience
Backend-specific model classes (ORTModel, OVModel) wrap the optimized model
Model kwargs allow passing additional parameters to the model loading process

Step 5: Inference Execution

Run inference through the pipeline using the standard call interface. The pipeline handles tokenization/preprocessing, feeds inputs to the optimized model, and applies post-processing to produce the final results. The accelerated backend provides the performance benefit transparently.

Key considerations:

The pipeline call interface is identical to transformers pipelines
Batch processing is supported for throughput optimization
The backend handles device transfer and memory management internally

Execution Diagram

GitHub URL

Workflow Repository