Workflow:SeldonIO Seldon core HuggingFace Model Serving

Knowledge Sources	Seldon Core 2 Seldon Core 2 Docs HuggingFace Models Guide
Domains	LLMOps, Model_Serving, NLP, Kubernetes
Last Updated	2026-02-13 14:00 GMT

Overview

End-to-end process for deploying HuggingFace transformer models (text generation, sentiment analysis, speech-to-text) on Seldon Core 2 via MLServer's HuggingFace runtime.

Description

This workflow covers deploying HuggingFace models through Seldon Core 2's MLServer integration. MLServer provides a dedicated HuggingFace runtime that supports various transformer pipelines including text generation, sentiment analysis, text classification, speech recognition, and other tasks available through the HuggingFace transformers library. Models can be served from pre-downloaded artifacts stored in cloud storage, or configured to download from the HuggingFace Model Hub at load time. The workflow also covers building multi-model pipelines that combine HuggingFace models with custom transformers for end-to-end applications like speech-to-sentiment analysis.

Usage

Execute this workflow when you need to serve HuggingFace transformer models as production inference endpoints. Common use cases include deploying text generation models (GPT-2, TinyStories), sentiment classifiers, speech-to-text systems (Whisper), and any other HuggingFace pipeline task. This is the recommended path for LLM and NLP model deployment within the Seldon Core 2 ecosystem.

Execution Steps

Step 1: Prepare HuggingFace Model Artifact

Download and package the HuggingFace model for deployment. The model artifacts should include the model weights, tokenizer files, and an MLServer model-settings.json that specifies the HuggingFace runtime implementation and the task type. Store the packaged artifacts in accessible cloud storage.

Key considerations:

Model-settings.json must specify the MLServer HuggingFace runtime implementation class
The task parameter (e.g., text-generation, sentiment-analysis) determines the pipeline type
Pre-downloading models avoids network dependencies at load time
For custom models, use the HuggingFace training scripts to fine-tune then save locally before uploading

Step 2: Define Model Resource

Create a Seldon Model manifest with the huggingface requirement and the storage URI pointing to the packaged model artifacts. The huggingface requirement ensures the model is scheduled onto a server with the HuggingFace runtime capability.

Key considerations:

The requirements field must include huggingface to match HuggingFace-capable servers
Memory requirements for transformer models are typically larger than traditional ML models
GPU-enabled servers may be needed for large language models
Multiple HuggingFace models can share the same MLServer instance via multi-model serving

Step 3: Deploy and Verify Model

Load the model onto the cluster and wait for it to reach the Available state. HuggingFace model loading may involve downloading pretrained weights and compiling the model graph, which can take longer than traditional model loading.

Key considerations:

Initial loading of large transformer models can take several minutes
Verify the model is fully loaded before sending inference requests
Check server logs if the model fails to load (common issues: missing dependencies, CUDA errors)
Memory usage should be monitored as transformer models can be memory-intensive

Step 4: Test Single Model Inference

Send inference requests matching the model's expected input format. Text models accept string inputs, while speech models accept base64-encoded audio data. Verify that the model produces the expected output format for its task type.

Key considerations:

Text inputs use BYTES datatype with string content in the V2 protocol
Audio inputs must be base64-encoded binary data
Response format varies by task (text generation returns generated text, sentiment returns labels and scores)
Use content_type field in model-settings.json to ensure correct data conversion

Step 5: Build Multi-Model Pipeline

Optionally compose the HuggingFace model into a pipeline with other models. For example, chain a Whisper speech-to-text model with a sentiment analysis model to create a speech-to-sentiment pipeline. Custom input/output transform models can adapt data formats between incompatible model interfaces.

Key considerations:

Custom transform models may be needed to convert between different model input/output formats
Transform models are standard MLServer Python models with custom predict logic
The pipeline handles data serialization/deserialization between steps automatically via Kafka
Output transforms can convert model outputs for downstream consumption (e.g., for explainers)

Step 6: Add Explainability

Optionally add explainer models to provide interpretability for HuggingFace model predictions. Anchor text explainers can identify which input tokens most influence the model's prediction, enabling debugging and trust in NLP model outputs.

Key considerations:

Explainers require pre-trained explanation models (e.g., anchor text explainer artifacts)
The explainer model references the base HuggingFace model in its explainer.modelRef field
Explanation requests go through the explainer endpoint, not the base model endpoint
Explainer inference is computationally expensive and should be used selectively

Execution Diagram

GitHub URL

Workflow Repository