Principle:SeldonIO Seldon core HuggingFace Text Inference
| Field | Value |
|---|---|
| Overview | Sending text-based inference requests to HuggingFace models using the V2 protocol BYTES datatype. |
| Domains | NLP, Inference |
| Related Implementation | SeldonIO_Seldon_core_Seldon_Model_Infer_BYTES |
| Knowledge Sources | Repo (https://github.com/SeldonIO/seldon-core), Doc (https://docs.seldon.io/projects/seldon-core/en/v2/) |
| Last Updated | 2026-02-13 00:00 GMT |
Description
HuggingFace models accept text input via the V2 BYTES datatype. Text strings are passed directly in the data array. REST requests use plain strings; gRPC requests require base64 encoding. Different HuggingFace model types (sentiment, text-gen, whisper) all use the same BYTES input format but produce different outputs (labels+scores, generated text, transcriptions).
The inference request structure follows the V2 Inference Protocol:
- Input tensor name -- typically
"args"for HuggingFace models - Shape --
[N]where N is the number of input strings (batch size) - Datatype -- always
"BYTES"for text input - Data -- array of text strings (REST) or base64-encoded strings (gRPC)
The response format varies by model type:
- Sentiment analysis -- returns labels (e.g., "POSITIVE", "NEGATIVE") and confidence scores
- Text generation -- returns generated text continuations
- Speech-to-text (Whisper) -- returns transcribed text from audio input
Theoretical Basis
The BYTES datatype in the V2 protocol handles variable-length opaque data including text strings. This bridges the gap between the tensor-oriented V2 protocol and NLP models that expect raw text input. The MLServer HuggingFace runtime internally tokenizes the text using the model's tokenizer.
The data flow for text inference is:
- The client sends raw text in the
"data"field withdatatype: "BYTES". - The MLServer HuggingFace runtime receives the V2 request and extracts the text strings.
- The runtime passes the text through the model's tokenizer to produce input tensors (token IDs, attention masks).
- The tokenized input is fed through the model for inference.
- The model output (logits, generated tokens) is decoded back to human-readable format.
- The response is formatted as a V2 response with BYTES output tensors.
For gRPC transport, the BYTES datatype requires base64 encoding because the Protocol Buffer wire format cannot directly represent arbitrary string content in tensor data fields. The client must encode text to base64 before sending, and decode the base64 response back to text.
Usage
This principle applies when sending text or audio input to deployed HuggingFace models for inference, including:
- Sending single or batched text strings for sentiment analysis
- Providing text prompts for auto-regressive text generation
- Submitting audio data for speech-to-text transcription via Whisper
- Using either REST or gRPC transport protocols
Related Pages
- SeldonIO_Seldon_core_Seldon_Model_Infer_BYTES -- implements this principle with concrete CLI examples for REST and gRPC
- SeldonIO_Seldon_core_HuggingFace_Model_Deployment_And_Verification -- precedes this principle; models must be deployed and verified before inference
- SeldonIO_Seldon_core_Multi_Modal_Pipeline_Composition -- extends this principle to multi-step pipelines chaining multiple models
- SeldonIO_Seldon_core_V2_Inference_Protocol -- generalizes the V2 protocol used for all inference request types
Implementation:SeldonIO_Seldon_core_Seldon_Model_Infer_BYTES