Implementation:Huggingface Optimum Pipeline Call
Overview
Wrapper Doc -- This page documents the Pipeline inference execution interface. The Pipeline class itself comes from the transformers library. Optimum's pipeline() function (in optimum/pipelines/__init__.py) returns a standard transformers.Pipeline instance backed by an optimized model.
Source
Primary: External -- transformers.Pipeline (from the transformers library)
Dispatch origin: optimum/pipelines/__init__.py
Repository: optimum
API
Pipeline.__call__
Pipeline.__call__(inputs, **kwargs) -> Any
Description: The main entry point for inference. Accepts raw inputs and returns processed predictions. Internally orchestrates the three-phase lifecycle.
Three-Phase Lifecycle
The Optimum pipeline() returns a standard transformers.Pipeline backed by an optimized model. The call interface follows the transformers Template Method pattern:
Phase 1: preprocess
def preprocess(self, inputs, **kwargs) -> dict:
"""Convert raw inputs into model-ready tensors."""
# Task-specific: tokenization, image processing, feature extraction
# Typically inherited from the transformers pipeline class
...
Responsibility: Converts raw user inputs (strings, images, audio arrays, etc.) into model-ready tensor dictionaries containing input_ids, attention_mask, pixel_values, or other model-specific inputs.
Phase 2: _forward
def _forward(self, model_inputs, **kwargs) -> dict:
"""Run model inference through the accelerated backend."""
# This is where the accelerated backend is invoked
# Backend-specific pipelines override this method
...
Responsibility: Runs the actual model inference. This is the phase where the acceleration happens -- the optimized model (ORTModel, OVModel, or IPEX-optimized model) executes the forward pass using its respective runtime.
Phase 3: postprocess
def postprocess(self, model_outputs, **kwargs) -> Any:
"""Convert raw model outputs into user-friendly predictions."""
# Task-specific: softmax, decoding, label mapping
# Typically inherited from the transformers pipeline class
...
Responsibility: Converts raw model outputs (logits, hidden states) into user-friendly prediction formats (dictionaries with labels, scores, spans, generated text, etc.).
Usage Example
from optimum.pipelines import pipeline
# Create accelerated pipeline (returns a transformers.Pipeline instance)
pipe = pipeline("text-classification", model="distilbert-base-uncased", accelerator="ort")
# Single inference
result = pipe("This is a great movie!")
# Returns: [{'label': 'POSITIVE', 'score': 0.9998}]
# Batch inference
results = pipe(["Great film!", "Terrible movie.", "It was okay."])
# Returns: [
# {'label': 'POSITIVE', 'score': 0.9997},
# {'label': 'NEGATIVE', 'score': 0.9994},
# {'label': 'NEUTRAL', 'score': 0.6234},
# ]
# With additional pipeline parameters
result = pipe("Some text", top_k=3, truncation=True)
Execution Flow
| Step | Method | Location | Description |
|---|---|---|---|
| 1 | Pipeline.__call__(inputs)
|
transformers
|
Entry point. Handles batching, chunking, and orchestrates the three phases. |
| 2 | preprocess(inputs)
|
transformers (task-specific)
|
Converts raw inputs to tensors. E.g., TextClassificationPipeline.preprocess calls the tokenizer.
|
| 3 | _forward(model_inputs)
|
Backend-specific override | Runs inference through the accelerated model. The OptimizedModel.__call__ delegates to forward().
|
| 4 | postprocess(model_outputs)
|
transformers (task-specific)
|
Converts model outputs to user-friendly format. E.g., applies softmax and maps label IDs to names. |
Backend Integration
The key integration point between Optimum and transformers is at the _forward phase. When the pipeline calls _forward, it invokes the model's __call__ method, which for OptimizedModel subclasses (defined in optimum/modeling_base.py at L108-109) delegates to the abstract forward() method:
# In OptimizedModel (optimum/modeling_base.py L108-109)
def __call__(self, *args, **kwargs):
return self.forward(*args, **kwargs)
Each backend subclass implements forward() to use its optimized runtime:
| Backend | forward() Implementation | Runtime Used |
|---|---|---|
| ONNX Runtime | ORTModel.forward()
|
onnxruntime.InferenceSession.run()
|
| OpenVINO | OVModel.forward()
|
OpenVINO compiled model inference |
| IPEX | IPEXModel.forward()
|
Intel Extension for PyTorch optimized execution |