Principle:Huggingface Optimum Pipeline Inference Execution
Overview
Standardized three-phase inference lifecycle (preprocess, forward, postprocess) for executing model predictions through accelerated backends.
Description
The inference execution follows HuggingFace transformers' Pipeline interface, which defines a three-phase lifecycle:
- preprocess: Converts raw inputs (text, images, audio) into model-ready tensors
- _forward: Runs the model inference through the accelerated backend
- postprocess: Converts raw model outputs into user-friendly predictions (labels, scores, spans, etc.)
This standardized interface means users interact with the same API regardless of the underlying backend. A pipeline created with accelerator="ort" is called the same way as one created with accelerator="ov" or accelerator="ipex".
Three-Phase Lifecycle
| Phase | Method | Input | Output | Description |
|---|---|---|---|---|
| 1. Preprocess | preprocess(inputs)
|
Raw user inputs (strings, images, audio arrays) | model_inputs (tensors, attention masks, etc.)
|
Tokenizes text, resizes images, extracts features, creates attention masks. Typically inherited from the transformers pipeline class. |
| 2. Forward | _forward(model_inputs)
|
Model-ready tensors | model_outputs (logits, hidden states, etc.)
|
Runs the actual model inference. This is the phase where the accelerated backend is invoked. Backend-specific pipelines may override this to use optimized inference paths. |
| 3. Postprocess | postprocess(model_outputs)
|
Raw model outputs | User-friendly predictions (dicts with labels, scores, etc.) | Applies softmax, decodes tokens, maps label IDs to names, formats results. Typically inherited from the transformers pipeline class. |
Usage
Use when executing inference through any Optimum-accelerated pipeline. The call interface is identical to transformers.Pipeline:
from optimum.pipelines import pipeline
# Create an accelerated pipeline
pipe = pipeline("text-classification", model="distilbert-base-uncased", accelerator="ort")
# Execute inference -- triggers preprocess -> _forward -> postprocess
result = pipe("This movie was absolutely wonderful!")
# Output: [{'label': 'POSITIVE', 'score': 0.9998}]
# Batch inference
results = pipe(["Great film!", "Terrible movie."])
# Output: [{'label': 'POSITIVE', 'score': 0.9997}, {'label': 'NEGATIVE', 'score': 0.9994}]
Theoretical Basis
Template Method pattern from transformers.Pipeline. The three phases (preprocess, _forward, postprocess) form a fixed algorithm skeleton defined in the base Pipeline.__call__ method. Each backend may override _forward to use its optimized inference path, while pre/post processing typically remains the same as the transformers implementation.
This design provides:
- Consistency: All pipelines, regardless of backend, follow the same execution flow
- Extensibility: Backends only need to override the inference step, reusing the well-tested preprocessing and postprocessing logic from transformers
- Compatibility: The returned pipeline is a standard
transformers.Pipelineinstance, so it works with all existing transformers pipeline utilities (batching, streaming, device placement, etc.)
Execution Flow
User calls: pipe("Some input text")
|
v
Pipeline.__call__(inputs)
|
+---> 1. preprocess("Some input text")
| --> {"input_ids": tensor(...), "attention_mask": tensor(...)}
|
+---> 2. _forward({"input_ids": ..., "attention_mask": ...})
| --> {"logits": tensor(...)}
| (This step uses the accelerated backend: ORT, OpenVINO, or IPEX)
|
+---> 3. postprocess({"logits": tensor(...)})
| --> [{"label": "POSITIVE", "score": 0.9998}]
|
v
Returns predictions to user
Backend Override Points
| Backend | Typical Override | Description |
|---|---|---|
| ONNX Runtime | _forward
|
Runs inference via ONNX Runtime InferenceSession instead of PyTorch
|
| OpenVINO | _forward
|
Runs inference via OpenVINO's compiled model inference engine |
| IPEX | _forward
|
Runs inference via Intel Extension for PyTorch optimized execution |
| All backends | preprocess, postprocess
|
Usually not overridden -- inherited from the task-specific transformers pipeline class |
Related
- implemented_by → Implementation:Huggingface_Optimum_Pipeline_Call