Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Pytorch Serve Model Explainability

From Leeroopedia
Field Value
Page Type Principle
Title Model Explainability
Short Description Making model predictions interpretable through attribution methods - using integrated gradients to identify which input features (tokens/words) contribute most to a prediction
Domains NLP, Explainability
Knowledge Sources TorchServe
Workflow HuggingFace_Transformer_Serving
Last Updated 2026-02-13 00:00 GMT

Overview

Model Explainability in the context of Transformer serving refers to the ability to provide human-interpretable explanations alongside model predictions. Specifically, this principle covers the use of attribution methods - techniques that assign importance scores to each input feature (token or word) based on its contribution to the model's output. In TorchServe's HuggingFace integration, this is implemented through Captum's Layer Integrated Gradients algorithm applied to the model's embedding layer, producing per-token importance scores that reveal which words drove the prediction.

Description

Why Explainability Matters

Transformer models are inherently opaque - they process input through multiple attention layers and nonlinear transformations, making it difficult to understand why a particular prediction was made. In production serving scenarios, explainability is important for:

  • Trust - Users and stakeholders need to verify that the model is making decisions for the right reasons
  • Debugging - Developers need to diagnose incorrect predictions and identify data or model issues
  • Compliance - Regulated industries may require explanations for automated decisions
  • Improvement - Understanding model behavior guides data collection and model refinement

Integrated Gradients

Integrated Gradients is an axiomatic attribution method that computes the contribution of each input feature by integrating the gradient of the model's output with respect to the input along a straight-line path from a baseline to the actual input.

The method satisfies two key axioms:

  1. Sensitivity - If a feature changes the prediction when modified, it receives a non-zero attribution
  2. Implementation Invariance - Two models that produce identical outputs for all inputs receive identical attributions, regardless of internal architecture differences

For Transformer models, Integrated Gradients is applied at the embedding layer rather than the raw input tokens, because the input space is discrete (token IDs) while the method requires continuous inputs. The embedding layer provides a continuous representation that can be meaningfully interpolated.

Layer Integrated Gradients

Captum's LayerIntegratedGradients extends the standard method by targeting a specific layer (the embedding layer) and computing attributions with respect to that layer's output. The process is:

  1. Construct input embeddings from the actual input token IDs
  2. Construct reference embeddings from a baseline input (padding tokens with CLS and SEP markers)
  3. Interpolate between reference and input embeddings along a straight path
  4. Compute gradients of the model output with respect to each interpolated point
  5. Integrate (sum) the gradients along the path
  6. Normalize the resulting attribution vector

Task-Specific Explanations

The explanation format varies by NLP task:

Sequence Classification and Token Classification:

The method produces a single set of attributions for the target class, yielding:

  • words - the list of input tokens
  • importances - per-token attribution scores (higher magnitude = more important)
  • delta - the convergence delta, indicating the approximation quality

Question Answering:

QA models produce two outputs (answer start and answer end positions), so the method runs twice:

  • words - the list of input tokens
  • importances_answer_start - attributions for the answer start prediction
  • importances_answer_end - attributions for the answer end prediction
  • delta_start and delta_end - convergence deltas for each

Baseline Construction

The choice of baseline is critical for meaningful attributions. In this implementation, the baseline is constructed by replacing all content tokens with the tokenizer's padding token ([PAD]) while preserving the special tokens ([CLS] and [SEP]). This represents a "content-free" input - the model processes the structural tokens but receives no semantic content. The attributions then represent each token's contribution relative to this neutral starting point.

Usage

To enable model explainability in TorchServe:

  1. Set captum_explanation: true in model-config.yaml
  2. Set embedding_name to the name of the model's embedding attribute (e.g., bert for BERT models)
  3. Send explanation requests to the TorchServe explain endpoint with the input text and target class

The input format for explanation requests is a JSON string containing:

  • "text" - the input text to explain
  • "target" - the target class index for which to compute attributions

For question answering, the text field should contain the question and context as usual.

Explanation requests are more computationally expensive than standard inference because they require multiple forward passes (one for each interpolation step along the integration path). This tradeoff between interpretability and latency should be considered when designing the serving architecture.

Theoretical Basis

Integrated Gradients was introduced by Sundararajan et al. (2017) as an attribution method that satisfies desirable axiomatic properties. The key theoretical foundation is the Fundamental Theorem of Calculus applied to neural networks: the difference in model output between the input and baseline can be decomposed exactly as the path integral of gradients.

Formally, for a model F, input x, and baseline x':

IG_i(x) = (x_i - x'_i) * integral from 0 to 1 of (dF / dx_i)(x' + alpha * (x - x')) d_alpha

The completeness property guarantees that the sum of all attributions equals the difference between the model's output at the input and at the baseline:

sum(IG_i(x)) = F(x) - F(x')

The convergence delta returned by the implementation measures how well this completeness property is satisfied in practice (with finite integration steps), serving as a quality check on the explanation.

The choice to apply this at the embedding layer (rather than the token ID input) is a practical adaptation for discrete inputs, following the approach described in the Captum library's documentation for NLP models.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment