Principle:Pytorch Serve Model Explainability
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Model Explainability |
| Short Description | Making model predictions interpretable through attribution methods - using integrated gradients to identify which input features (tokens/words) contribute most to a prediction |
| Domains | NLP, Explainability |
| Knowledge Sources | TorchServe |
| Workflow | HuggingFace_Transformer_Serving |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Model Explainability in the context of Transformer serving refers to the ability to provide human-interpretable explanations alongside model predictions. Specifically, this principle covers the use of attribution methods - techniques that assign importance scores to each input feature (token or word) based on its contribution to the model's output. In TorchServe's HuggingFace integration, this is implemented through Captum's Layer Integrated Gradients algorithm applied to the model's embedding layer, producing per-token importance scores that reveal which words drove the prediction.
Description
Why Explainability Matters
Transformer models are inherently opaque - they process input through multiple attention layers and nonlinear transformations, making it difficult to understand why a particular prediction was made. In production serving scenarios, explainability is important for:
- Trust - Users and stakeholders need to verify that the model is making decisions for the right reasons
- Debugging - Developers need to diagnose incorrect predictions and identify data or model issues
- Compliance - Regulated industries may require explanations for automated decisions
- Improvement - Understanding model behavior guides data collection and model refinement
Integrated Gradients
Integrated Gradients is an axiomatic attribution method that computes the contribution of each input feature by integrating the gradient of the model's output with respect to the input along a straight-line path from a baseline to the actual input.
The method satisfies two key axioms:
- Sensitivity - If a feature changes the prediction when modified, it receives a non-zero attribution
- Implementation Invariance - Two models that produce identical outputs for all inputs receive identical attributions, regardless of internal architecture differences
For Transformer models, Integrated Gradients is applied at the embedding layer rather than the raw input tokens, because the input space is discrete (token IDs) while the method requires continuous inputs. The embedding layer provides a continuous representation that can be meaningfully interpolated.
Layer Integrated Gradients
Captum's LayerIntegratedGradients extends the standard method by targeting a specific layer (the embedding layer) and computing attributions with respect to that layer's output. The process is:
- Construct input embeddings from the actual input token IDs
- Construct reference embeddings from a baseline input (padding tokens with CLS and SEP markers)
- Interpolate between reference and input embeddings along a straight path
- Compute gradients of the model output with respect to each interpolated point
- Integrate (sum) the gradients along the path
- Normalize the resulting attribution vector
Task-Specific Explanations
The explanation format varies by NLP task:
Sequence Classification and Token Classification:
The method produces a single set of attributions for the target class, yielding:
- words - the list of input tokens
- importances - per-token attribution scores (higher magnitude = more important)
- delta - the convergence delta, indicating the approximation quality
Question Answering:
QA models produce two outputs (answer start and answer end positions), so the method runs twice:
- words - the list of input tokens
- importances_answer_start - attributions for the answer start prediction
- importances_answer_end - attributions for the answer end prediction
- delta_start and delta_end - convergence deltas for each
Baseline Construction
The choice of baseline is critical for meaningful attributions. In this implementation, the baseline is constructed by replacing all content tokens with the tokenizer's padding token ([PAD]) while preserving the special tokens ([CLS] and [SEP]). This represents a "content-free" input - the model processes the structural tokens but receives no semantic content. The attributions then represent each token's contribution relative to this neutral starting point.
Usage
To enable model explainability in TorchServe:
- Set
captum_explanation: trueinmodel-config.yaml - Set
embedding_nameto the name of the model's embedding attribute (e.g.,bertfor BERT models) - Send explanation requests to the TorchServe explain endpoint with the input text and target class
The input format for explanation requests is a JSON string containing:
"text"- the input text to explain"target"- the target class index for which to compute attributions
For question answering, the text field should contain the question and context as usual.
Explanation requests are more computationally expensive than standard inference because they require multiple forward passes (one for each interpolation step along the integration path). This tradeoff between interpretability and latency should be considered when designing the serving architecture.
Theoretical Basis
Integrated Gradients was introduced by Sundararajan et al. (2017) as an attribution method that satisfies desirable axiomatic properties. The key theoretical foundation is the Fundamental Theorem of Calculus applied to neural networks: the difference in model output between the input and baseline can be decomposed exactly as the path integral of gradients.
Formally, for a model F, input x, and baseline x':
IG_i(x) = (x_i - x'_i) * integral from 0 to 1 of (dF / dx_i)(x' + alpha * (x - x')) d_alpha
The completeness property guarantees that the sum of all attributions equals the difference between the model's output at the input and at the baseline:
sum(IG_i(x)) = F(x) - F(x')
The convergence delta returned by the implementation measures how well this completeness property is satisfied in practice (with finite integration steps), serving as a quality check on the explanation.
The choice to apply this at the embedding layer (rather than the token ID input) is a practical adaptation for discrete inputs, following the approach described in the Captum library's documentation for NLP models.
Related Pages
- Implementation:Pytorch_Serve_Captum_Explanations - The implementation of integrated gradients for Transformer models
- Principle:Pytorch_Serve_Generalized_NLP_Handler - The handler that invokes explainability through
get_insights() - Principle:Pytorch_Serve_Transformer_Configuration - Configuration that enables/disables Captum explanations