Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Optimum Pipeline Call

From Leeroopedia

Overview

Wrapper Doc -- This page documents the Pipeline inference execution interface. The Pipeline class itself comes from the transformers library. Optimum's pipeline() function (in optimum/pipelines/__init__.py) returns a standard transformers.Pipeline instance backed by an optimized model.

Source

Primary: External -- transformers.Pipeline (from the transformers library)

Dispatch origin: optimum/pipelines/__init__.py

Repository: optimum

API

Pipeline.__call__

Pipeline.__call__(inputs, **kwargs) -> Any

Description: The main entry point for inference. Accepts raw inputs and returns processed predictions. Internally orchestrates the three-phase lifecycle.

Three-Phase Lifecycle

The Optimum pipeline() returns a standard transformers.Pipeline backed by an optimized model. The call interface follows the transformers Template Method pattern:

Phase 1: preprocess

def preprocess(self, inputs, **kwargs) -> dict:
    """Convert raw inputs into model-ready tensors."""
    # Task-specific: tokenization, image processing, feature extraction
    # Typically inherited from the transformers pipeline class
    ...

Responsibility: Converts raw user inputs (strings, images, audio arrays, etc.) into model-ready tensor dictionaries containing input_ids, attention_mask, pixel_values, or other model-specific inputs.

Phase 2: _forward

def _forward(self, model_inputs, **kwargs) -> dict:
    """Run model inference through the accelerated backend."""
    # This is where the accelerated backend is invoked
    # Backend-specific pipelines override this method
    ...

Responsibility: Runs the actual model inference. This is the phase where the acceleration happens -- the optimized model (ORTModel, OVModel, or IPEX-optimized model) executes the forward pass using its respective runtime.

Phase 3: postprocess

def postprocess(self, model_outputs, **kwargs) -> Any:
    """Convert raw model outputs into user-friendly predictions."""
    # Task-specific: softmax, decoding, label mapping
    # Typically inherited from the transformers pipeline class
    ...

Responsibility: Converts raw model outputs (logits, hidden states) into user-friendly prediction formats (dictionaries with labels, scores, spans, generated text, etc.).

Usage Example

from optimum.pipelines import pipeline

# Create accelerated pipeline (returns a transformers.Pipeline instance)
pipe = pipeline("text-classification", model="distilbert-base-uncased", accelerator="ort")

# Single inference
result = pipe("This is a great movie!")
# Returns: [{'label': 'POSITIVE', 'score': 0.9998}]

# Batch inference
results = pipe(["Great film!", "Terrible movie.", "It was okay."])
# Returns: [
#     {'label': 'POSITIVE', 'score': 0.9997},
#     {'label': 'NEGATIVE', 'score': 0.9994},
#     {'label': 'NEUTRAL', 'score': 0.6234},
# ]

# With additional pipeline parameters
result = pipe("Some text", top_k=3, truncation=True)

Execution Flow

Step Method Location Description
1 Pipeline.__call__(inputs) transformers Entry point. Handles batching, chunking, and orchestrates the three phases.
2 preprocess(inputs) transformers (task-specific) Converts raw inputs to tensors. E.g., TextClassificationPipeline.preprocess calls the tokenizer.
3 _forward(model_inputs) Backend-specific override Runs inference through the accelerated model. The OptimizedModel.__call__ delegates to forward().
4 postprocess(model_outputs) transformers (task-specific) Converts model outputs to user-friendly format. E.g., applies softmax and maps label IDs to names.

Backend Integration

The key integration point between Optimum and transformers is at the _forward phase. When the pipeline calls _forward, it invokes the model's __call__ method, which for OptimizedModel subclasses (defined in optimum/modeling_base.py at L108-109) delegates to the abstract forward() method:

# In OptimizedModel (optimum/modeling_base.py L108-109)
def __call__(self, *args, **kwargs):
    return self.forward(*args, **kwargs)

Each backend subclass implements forward() to use its optimized runtime:

Backend forward() Implementation Runtime Used
ONNX Runtime ORTModel.forward() onnxruntime.InferenceSession.run()
OpenVINO OVModel.forward() OpenVINO compiled model inference
IPEX IPEXModel.forward() Intel Extension for PyTorch optimized execution

External Reference

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment