Workflow:Huggingface Optimum Accelerated Inference Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Inference, Model_Optimization, MLOps |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
End-to-end process for running accelerated model inference using the Optimum pipeline API with hardware-specific backends (ONNX Runtime, OpenVINO, Intel IPEX).
Description
This workflow describes how to use the optimum.pipelines.pipeline() factory function to create an inference pipeline backed by an optimized runtime. The function mirrors the familiar transformers.pipeline() API but routes model loading and execution to a hardware-accelerated backend. It automatically detects which backend packages are installed (optimum-onnx for ONNX Runtime, optimum-intel for OpenVINO/IPEX) and delegates to the appropriate backend-specific pipeline implementation.
Key aspects:
- Drop-in replacement for transformers.pipeline() with an accelerator parameter
- Supports 30+ tasks (text classification, generation, QA, image classification, speech recognition, etc.)
- Three accelerator backends: ONNX Runtime ("ort"), OpenVINO ("ov"), Intel IPEX ("ipex")
- Automatic backend detection when accelerator is not explicitly specified
- Full compatibility with Hugging Face Hub model identifiers and local model paths
Usage
Execute this workflow when you want to run model inference with hardware acceleration without writing backend-specific code. This is the primary entry point for users who want faster inference on their existing Hugging Face models by leveraging ONNX Runtime (GPU/CPU), OpenVINO (Intel hardware), or IPEX (Intel PyTorch extension). The pipeline handles model loading, preprocessing, inference, and post-processing.
Execution Steps
Step 1: Backend Detection
Determine which inference backend to use. If the accelerator parameter is explicitly provided ("ort", "ov", or "ipex"), that backend is selected. Otherwise, the system checks which optimum subpackages are installed and selects the first available backend in priority order: OpenVINO, ONNX Runtime, IPEX.
Key considerations:
- OpenVINO requires optimum-intel[openvino] to be installed
- ONNX Runtime requires optimum-onnx[onnxruntime] to be installed
- IPEX requires optimum-intel[ipex] to be installed
- If no backend is available, an ImportError is raised with installation instructions
Step 2: Pipeline Configuration
Configure the pipeline parameters including the task, model identifier, tokenizer/processor, device placement, and dtype settings. The parameters mirror the standard transformers pipeline API, allowing direct migration from existing code.
Key considerations:
- The task parameter determines which pipeline class is instantiated
- Model can be a Hub ID string, local path, or pre-loaded optimized model object
- Tokenizer, feature extractor, image processor, and processor can be auto-resolved or explicitly provided
- Device and device_map control hardware placement; they should not be used simultaneously
- trust_remote_code enables models with custom code on the Hub
Step 3: Backend-specific Pipeline Instantiation
Delegate to the backend-specific pipeline factory. For ONNX Runtime, this imports and calls optimum.onnxruntime.pipeline(). For OpenVINO or IPEX, this imports and calls optimum.intel.pipeline(). The backend-specific factory handles model conversion (if needed), optimization, and runtime initialization.
Key considerations:
- ONNX Runtime pipeline loads or converts models to ONNX format
- OpenVINO pipeline loads or converts models to OpenVINO IR format
- IPEX pipeline applies Intel-specific PyTorch optimizations
- All backends return a standard transformers Pipeline object
Step 4: Model Loading and Optimization
The backend loads the model in its optimized format. If the model is only available as a PyTorch checkpoint, the backend may perform on-the-fly conversion. The model is placed on the specified device and any backend-specific optimizations (graph optimization, operator fusion, quantization) are applied.
Key considerations:
- Pre-exported models (ONNX, OpenVINO IR) are loaded directly for fastest startup
- On-the-fly conversion adds initial latency but provides seamless experience
- Backend-specific model classes (ORTModel, OVModel) wrap the optimized model
- Model kwargs allow passing additional parameters to the model loading process
Step 5: Inference Execution
Run inference through the pipeline using the standard call interface. The pipeline handles tokenization/preprocessing, feeds inputs to the optimized model, and applies post-processing to produce the final results. The accelerated backend provides the performance benefit transparently.
Key considerations:
- The pipeline call interface is identical to transformers pipelines
- Batch processing is supported for throughput optimization
- The backend handles device transfer and memory management internally