Principle:SeldonIO Seldon core HuggingFace Text Inference

Field	Value
Overview	Sending text-based inference requests to HuggingFace models using the V2 protocol BYTES datatype.
Domains	NLP, Inference
Related Implementation	SeldonIO_Seldon_core_Seldon_Model_Infer_BYTES
Knowledge Sources	Repo (https://github.com/SeldonIO/seldon-core), Doc (https://docs.seldon.io/projects/seldon-core/en/v2/)
Last Updated	2026-02-13 00:00 GMT

Description

HuggingFace models accept text input via the V2 BYTES datatype. Text strings are passed directly in the data array. REST requests use plain strings; gRPC requests require base64 encoding. Different HuggingFace model types (sentiment, text-gen, whisper) all use the same BYTES input format but produce different outputs (labels+scores, generated text, transcriptions).

The inference request structure follows the V2 Inference Protocol:

Input tensor name -- typically "args" for HuggingFace models
Shape -- [N] where N is the number of input strings (batch size)
Datatype -- always "BYTES" for text input
Data -- array of text strings (REST) or base64-encoded strings (gRPC)

The response format varies by model type:

Sentiment analysis -- returns labels (e.g., "POSITIVE", "NEGATIVE") and confidence scores
Text generation -- returns generated text continuations
Speech-to-text (Whisper) -- returns transcribed text from audio input

Theoretical Basis

The BYTES datatype in the V2 protocol handles variable-length opaque data including text strings. This bridges the gap between the tensor-oriented V2 protocol and NLP models that expect raw text input. The MLServer HuggingFace runtime internally tokenizes the text using the model's tokenizer.

The data flow for text inference is:

The client sends raw text in the "data" field with datatype: "BYTES".
The MLServer HuggingFace runtime receives the V2 request and extracts the text strings.
The runtime passes the text through the model's tokenizer to produce input tensors (token IDs, attention masks).
The tokenized input is fed through the model for inference.
The model output (logits, generated tokens) is decoded back to human-readable format.
The response is formatted as a V2 response with BYTES output tensors.

For gRPC transport, the BYTES datatype requires base64 encoding because the Protocol Buffer wire format cannot directly represent arbitrary string content in tensor data fields. The client must encode text to base64 before sending, and decode the base64 response back to text.

Usage

This principle applies when sending text or audio input to deployed HuggingFace models for inference, including:

Sending single or batched text strings for sentiment analysis
Providing text prompts for auto-regressive text generation
Submitting audio data for speech-to-text transcription via Whisper
Using either REST or gRPC transport protocols

Related Pages

SeldonIO_Seldon_core_Seldon_Model_Infer_BYTES -- implements this principle with concrete CLI examples for REST and gRPC
SeldonIO_Seldon_core_HuggingFace_Model_Deployment_And_Verification -- precedes this principle; models must be deployed and verified before inference
SeldonIO_Seldon_core_Multi_Modal_Pipeline_Composition -- extends this principle to multi-step pipelines chaining multiple models
SeldonIO_Seldon_core_V2_Inference_Protocol -- generalizes the V2 protocol used for all inference request types

Implementation:SeldonIO_Seldon_core_Seldon_Model_Infer_BYTES

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment