Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haifengl Smile Streaming Batch Prediction

From Leeroopedia


Overview

Streaming Batch Prediction is the principle of processing multiple inference requests as a reactive stream, where the client sends a sequence of input records (one per line) and the server returns predictions incrementally as they are computed, rather than buffering the entire batch and returning all results at once.

Theoretical Basis

Batch vs. Single-Request Inference

Model serving typically supports two interaction patterns:

Pattern Description Use Case
Single request One input, one prediction per HTTP request Real-time applications (web UI, mobile app)
Batch request Multiple inputs, multiple predictions in one HTTP request Offline scoring, data pipelines, ETL jobs

Batch inference is essential for production ML pipelines where thousands or millions of records must be scored. A naive implementation would accept the entire batch as a JSON array, process all records, and return all results in a single response. This approach has significant drawbacks:

  • Memory pressure -- the entire input and output must fit in memory simultaneously.
  • Latency -- the client must wait for the slowest record before seeing any results.
  • Timeout risk -- large batches may exceed HTTP timeout thresholds.

Reactive Streams

The Reactive Streams specification (part of the Reactive Manifesto) defines an asynchronous data processing model with four key properties:

  • Non-blocking -- the server processes items without blocking the main event loop.
  • Backpressure -- the producer (server) does not overwhelm the consumer (client) by respecting its consumption rate.
  • Lazy evaluation -- items are processed on demand rather than eagerly.
  • Error propagation -- errors in the stream are propagated to the consumer rather than silently dropped.

Applied to batch prediction, reactive streams enable the server to:

  1. Read one input record from the request body stream.
  2. Compute the prediction for that record.
  3. Emit the result to the response stream immediately.
  4. Repeat until all records are processed.

This means the client receives the first prediction as soon as the first record is processed, while subsequent records are still being read and scored.

Time-to-First-Result

A critical metric for batch inference is time-to-first-result (TTFR). In a traditional batch approach, TTFR equals the time to process all records. With streaming, TTFR equals the time to process just the first record -- a dramatic improvement for large batches.

Approach TTFR for N records Memory Usage
Buffered batch O(N * t_predict) O(N) -- all inputs and outputs in memory
Streaming O(t_predict) O(1) -- one record at a time

where t_predict is the prediction time for a single record.

Line-Oriented Streaming Protocol

The streaming protocol uses a line-oriented format: each input record is a single line (either JSON or CSV), and each output prediction is a single line of text. This design choice has several advantages:

  • Simplicity -- no complex framing protocol is needed; newlines delimit records.
  • Multi-format support -- the same endpoint handles JSON and CSV inputs based on the Content-Type header.
  • Pipe-friendliness -- the text-based protocol works naturally with Unix pipes and command-line tools.
  • Incremental parsing -- each line can be parsed independently without maintaining parser state.

Worker Pool Offloading

In reactive server architectures (like Quarkus with Vert.x), the event loop thread must not be blocked by CPU-intensive work. ML model prediction, while typically fast per record, involves non-trivial computation. Streaming implementations must offload the prediction loop to a worker pool thread, keeping the event loop free for other connections.

Design Considerations

Backpressure and Flow Control

True reactive backpressure means the server only reads the next input line when the client is ready for more output. In practice, the line-by-line processing in Smile's streaming endpoint provides natural flow control: each line is read, predicted, and emitted before the next line is consumed.

Error Handling in Streams

Errors in a streaming context are more nuanced than in single-request processing. If the 50th record out of 1000 fails validation, the server must decide whether to:

  • Fail the entire stream -- emit an error and close the connection.
  • Skip the bad record -- emit an error marker and continue processing.

Smile's implementation chooses to fail the entire stream on any error, which is the safer default.

Content-Type Negotiation

Supporting both JSON and CSV as input formats via the Content-Type header provides flexibility. CSV is more compact and faster to parse for large numeric datasets, while JSON is self-describing and more compatible with web clients.

Knowledge Sources

Smile

Domains

MLOps, Model_Deployment, Reactive_Systems

Related

Implementation:Haifengl_Smile_Streaming_Prediction_API

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment