Principle:Predibase Lorax Streaming Response Handling
| Knowledge Sources | |
|---|---|
| Domains | API_Design, Streaming |
| Last Updated | 2026-02-08 02:00 GMT |
Overview
A response delivery pattern that streams generated tokens to clients in real-time using Server-Sent Events (SSE), with complete metadata available at stream completion.
Description
Streaming Response Handling addresses the user experience problem of long-running inference requests. Instead of waiting for the entire generation to complete, tokens are streamed as they are produced. This pattern uses:
- SSE (Server-Sent Events): HTTP-based streaming where each event contains a single token
- Non-streaming fallback: Full response returned as a single JSON object
- Response metadata: Finish reason, token counts, timing headers, and optional per-token logprobs
The response includes important metadata in HTTP headers: x-compute-time, x-total-time, x-prompt-tokens, x-generated-tokens, and x-adapter-id.
Usage
Use streaming when building interactive applications (chatbots, code assistants) where time-to-first-token matters. Use non-streaming for batch processing where complete responses are needed.
Theoretical Basis
Pseudo-code:
# SSE streaming pattern
if request.stream:
for token in generate_tokens(request):
yield SSEEvent(data=StreamResponse(
token=token,
generated_text=None, # only on last token
))
yield SSEEvent(data=StreamResponse(
token=last_token,
generated_text=full_text,
details=StreamDetails(finish_reason, token_count),
))
else:
tokens = generate_all_tokens(request)
return Response(generated_text=join(tokens), details=Details(...))