Implementation:Huggingface Optimum Pipeline Call

Overview

Wrapper Doc -- This page documents the Pipeline inference execution interface. The Pipeline class itself comes from the transformers library. Optimum's pipeline() function (in optimum/pipelines/__init__.py) returns a standard transformers.Pipeline instance backed by an optimized model.

Source

Primary: External -- transformers.Pipeline (from the transformers library)

Dispatch origin: optimum/pipelines/__init__.py

Repository: optimum

API

Pipeline.call

Pipeline.__call__(inputs, **kwargs) -> Any

Description: The main entry point for inference. Accepts raw inputs and returns processed predictions. Internally orchestrates the three-phase lifecycle.

Three-Phase Lifecycle

The Optimum pipeline() returns a standard transformers.Pipeline backed by an optimized model. The call interface follows the transformers Template Method pattern:

Phase 1: preprocess

def preprocess(self, inputs, **kwargs) -> dict:
    """Convert raw inputs into model-ready tensors."""
    # Task-specific: tokenization, image processing, feature extraction
    # Typically inherited from the transformers pipeline class
    ...

Responsibility: Converts raw user inputs (strings, images, audio arrays, etc.) into model-ready tensor dictionaries containing input_ids, attention_mask, pixel_values, or other model-specific inputs.

Phase 2: _forward

def _forward(self, model_inputs, **kwargs) -> dict:
    """Run model inference through the accelerated backend."""
    # This is where the accelerated backend is invoked
    # Backend-specific pipelines override this method
    ...

Responsibility: Runs the actual model inference. This is the phase where the acceleration happens -- the optimized model (ORTModel, OVModel, or IPEX-optimized model) executes the forward pass using its respective runtime.

Phase 3: postprocess

def postprocess(self, model_outputs, **kwargs) -> Any:
    """Convert raw model outputs into user-friendly predictions."""
    # Task-specific: softmax, decoding, label mapping
    # Typically inherited from the transformers pipeline class
    ...

Responsibility: Converts raw model outputs (logits, hidden states) into user-friendly prediction formats (dictionaries with labels, scores, spans, generated text, etc.).

Usage Example

from optimum.pipelines import pipeline

# Create accelerated pipeline (returns a transformers.Pipeline instance)
pipe = pipeline("text-classification", model="distilbert-base-uncased", accelerator="ort")

# Single inference
result = pipe("This is a great movie!")
# Returns: [{'label': 'POSITIVE', 'score': 0.9998}]

# Batch inference
results = pipe(["Great film!", "Terrible movie.", "It was okay."])
# Returns: [
#     {'label': 'POSITIVE', 'score': 0.9997},
#     {'label': 'NEGATIVE', 'score': 0.9994},
#     {'label': 'NEUTRAL', 'score': 0.6234},
# ]

# With additional pipeline parameters
result = pipe("Some text", top_k=3, truncation=True)

Execution Flow

Step	Method	Location	Description
1	`Pipeline.__call__(inputs)`	`transformers`	Entry point. Handles batching, chunking, and orchestrates the three phases.
2	`preprocess(inputs)`	`transformers` (task-specific)	Converts raw inputs to tensors. E.g., `TextClassificationPipeline.preprocess` calls the tokenizer.
3	`_forward(model_inputs)`	Backend-specific override	Runs inference through the accelerated model. The `OptimizedModel.__call__` delegates to `forward()`.
4	`postprocess(model_outputs)`	`transformers` (task-specific)	Converts model outputs to user-friendly format. E.g., applies softmax and maps label IDs to names.

Backend Integration

The key integration point between Optimum and transformers is at the _forward phase. When the pipeline calls _forward, it invokes the model's __call__ method, which for OptimizedModel subclasses (defined in optimum/modeling_base.py at L108-109) delegates to the abstract forward() method:

# In OptimizedModel (optimum/modeling_base.py L108-109)
def __call__(self, *args, **kwargs):
    return self.forward(*args, **kwargs)

Each backend subclass implements forward() to use its optimized runtime:

Backend	forward() Implementation	Runtime Used
ONNX Runtime	`ORTModel.forward()`	`onnxruntime.InferenceSession.run()`
OpenVINO	`OVModel.forward()`	OpenVINO compiled model inference
IPEX	`IPEXModel.forward()`	Intel Extension for PyTorch optimized execution

External Reference

HuggingFace Transformers Pipeline Documentation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment