Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:SeldonIO Seldon core HuggingFace Model Serving

From Leeroopedia
Knowledge Sources
Domains LLMOps, Model_Serving, NLP, Kubernetes
Last Updated 2026-02-13 14:00 GMT

Overview

End-to-end process for deploying HuggingFace transformer models (text generation, sentiment analysis, speech-to-text) on Seldon Core 2 via MLServer's HuggingFace runtime.

Description

This workflow covers deploying HuggingFace models through Seldon Core 2's MLServer integration. MLServer provides a dedicated HuggingFace runtime that supports various transformer pipelines including text generation, sentiment analysis, text classification, speech recognition, and other tasks available through the HuggingFace transformers library. Models can be served from pre-downloaded artifacts stored in cloud storage, or configured to download from the HuggingFace Model Hub at load time. The workflow also covers building multi-model pipelines that combine HuggingFace models with custom transformers for end-to-end applications like speech-to-sentiment analysis.

Usage

Execute this workflow when you need to serve HuggingFace transformer models as production inference endpoints. Common use cases include deploying text generation models (GPT-2, TinyStories), sentiment classifiers, speech-to-text systems (Whisper), and any other HuggingFace pipeline task. This is the recommended path for LLM and NLP model deployment within the Seldon Core 2 ecosystem.

Execution Steps

Step 1: Prepare HuggingFace Model Artifact

Download and package the HuggingFace model for deployment. The model artifacts should include the model weights, tokenizer files, and an MLServer model-settings.json that specifies the HuggingFace runtime implementation and the task type. Store the packaged artifacts in accessible cloud storage.

Key considerations:

  • Model-settings.json must specify the MLServer HuggingFace runtime implementation class
  • The task parameter (e.g., text-generation, sentiment-analysis) determines the pipeline type
  • Pre-downloading models avoids network dependencies at load time
  • For custom models, use the HuggingFace training scripts to fine-tune then save locally before uploading

Step 2: Define Model Resource

Create a Seldon Model manifest with the huggingface requirement and the storage URI pointing to the packaged model artifacts. The huggingface requirement ensures the model is scheduled onto a server with the HuggingFace runtime capability.

Key considerations:

  • The requirements field must include huggingface to match HuggingFace-capable servers
  • Memory requirements for transformer models are typically larger than traditional ML models
  • GPU-enabled servers may be needed for large language models
  • Multiple HuggingFace models can share the same MLServer instance via multi-model serving

Step 3: Deploy and Verify Model

Load the model onto the cluster and wait for it to reach the Available state. HuggingFace model loading may involve downloading pretrained weights and compiling the model graph, which can take longer than traditional model loading.

Key considerations:

  • Initial loading of large transformer models can take several minutes
  • Verify the model is fully loaded before sending inference requests
  • Check server logs if the model fails to load (common issues: missing dependencies, CUDA errors)
  • Memory usage should be monitored as transformer models can be memory-intensive

Step 4: Test Single Model Inference

Send inference requests matching the model's expected input format. Text models accept string inputs, while speech models accept base64-encoded audio data. Verify that the model produces the expected output format for its task type.

Key considerations:

  • Text inputs use BYTES datatype with string content in the V2 protocol
  • Audio inputs must be base64-encoded binary data
  • Response format varies by task (text generation returns generated text, sentiment returns labels and scores)
  • Use content_type field in model-settings.json to ensure correct data conversion

Step 5: Build Multi-Model Pipeline

Optionally compose the HuggingFace model into a pipeline with other models. For example, chain a Whisper speech-to-text model with a sentiment analysis model to create a speech-to-sentiment pipeline. Custom input/output transform models can adapt data formats between incompatible model interfaces.

Key considerations:

  • Custom transform models may be needed to convert between different model input/output formats
  • Transform models are standard MLServer Python models with custom predict logic
  • The pipeline handles data serialization/deserialization between steps automatically via Kafka
  • Output transforms can convert model outputs for downstream consumption (e.g., for explainers)

Step 6: Add Explainability

Optionally add explainer models to provide interpretability for HuggingFace model predictions. Anchor text explainers can identify which input tokens most influence the model's prediction, enabling debugging and trust in NLP model outputs.

Key considerations:

  • Explainers require pre-trained explanation models (e.g., anchor text explainer artifacts)
  • The explainer model references the base HuggingFace model in its explainer.modelRef field
  • Explanation requests go through the explainer endpoint, not the base model endpoint
  • Explainer inference is computationally expensive and should be used selectively

Execution Diagram

GitHub URL

Workflow Repository