Principle:Haifengl Smile Streaming Batch Prediction
Overview
Streaming Batch Prediction is the principle of processing multiple inference requests as a reactive stream, where the client sends a sequence of input records (one per line) and the server returns predictions incrementally as they are computed, rather than buffering the entire batch and returning all results at once.
Theoretical Basis
Batch vs. Single-Request Inference
Model serving typically supports two interaction patterns:
| Pattern | Description | Use Case |
|---|---|---|
| Single request | One input, one prediction per HTTP request | Real-time applications (web UI, mobile app) |
| Batch request | Multiple inputs, multiple predictions in one HTTP request | Offline scoring, data pipelines, ETL jobs |
Batch inference is essential for production ML pipelines where thousands or millions of records must be scored. A naive implementation would accept the entire batch as a JSON array, process all records, and return all results in a single response. This approach has significant drawbacks:
- Memory pressure -- the entire input and output must fit in memory simultaneously.
- Latency -- the client must wait for the slowest record before seeing any results.
- Timeout risk -- large batches may exceed HTTP timeout thresholds.
Reactive Streams
The Reactive Streams specification (part of the Reactive Manifesto) defines an asynchronous data processing model with four key properties:
- Non-blocking -- the server processes items without blocking the main event loop.
- Backpressure -- the producer (server) does not overwhelm the consumer (client) by respecting its consumption rate.
- Lazy evaluation -- items are processed on demand rather than eagerly.
- Error propagation -- errors in the stream are propagated to the consumer rather than silently dropped.
Applied to batch prediction, reactive streams enable the server to:
- Read one input record from the request body stream.
- Compute the prediction for that record.
- Emit the result to the response stream immediately.
- Repeat until all records are processed.
This means the client receives the first prediction as soon as the first record is processed, while subsequent records are still being read and scored.
Time-to-First-Result
A critical metric for batch inference is time-to-first-result (TTFR). In a traditional batch approach, TTFR equals the time to process all records. With streaming, TTFR equals the time to process just the first record -- a dramatic improvement for large batches.
| Approach | TTFR for N records | Memory Usage |
|---|---|---|
| Buffered batch | O(N * t_predict) | O(N) -- all inputs and outputs in memory |
| Streaming | O(t_predict) | O(1) -- one record at a time |
where t_predict is the prediction time for a single record.
Line-Oriented Streaming Protocol
The streaming protocol uses a line-oriented format: each input record is a single line (either JSON or CSV), and each output prediction is a single line of text. This design choice has several advantages:
- Simplicity -- no complex framing protocol is needed; newlines delimit records.
- Multi-format support -- the same endpoint handles JSON and CSV inputs based on the Content-Type header.
- Pipe-friendliness -- the text-based protocol works naturally with Unix pipes and command-line tools.
- Incremental parsing -- each line can be parsed independently without maintaining parser state.
Worker Pool Offloading
In reactive server architectures (like Quarkus with Vert.x), the event loop thread must not be blocked by CPU-intensive work. ML model prediction, while typically fast per record, involves non-trivial computation. Streaming implementations must offload the prediction loop to a worker pool thread, keeping the event loop free for other connections.
Design Considerations
Backpressure and Flow Control
True reactive backpressure means the server only reads the next input line when the client is ready for more output. In practice, the line-by-line processing in Smile's streaming endpoint provides natural flow control: each line is read, predicted, and emitted before the next line is consumed.
Error Handling in Streams
Errors in a streaming context are more nuanced than in single-request processing. If the 50th record out of 1000 fails validation, the server must decide whether to:
- Fail the entire stream -- emit an error and close the connection.
- Skip the bad record -- emit an error marker and continue processing.
Smile's implementation chooses to fail the entire stream on any error, which is the safer default.
Content-Type Negotiation
Supporting both JSON and CSV as input formats via the Content-Type header provides flexibility. CSV is more compact and faster to parse for large numeric datasets, while JSON is self-describing and more compatible with web clients.
Knowledge Sources
Domains
MLOps, Model_Deployment, Reactive_Systems