Principle:Predibase Lorax Streaming Response Handling

Knowledge Sources	Server-Sent Events
Domains	API_Design, Streaming
Last Updated	2026-02-08 02:00 GMT

Overview

A response delivery pattern that streams generated tokens to clients in real-time using Server-Sent Events (SSE), with complete metadata available at stream completion.

Description

Streaming Response Handling addresses the user experience problem of long-running inference requests. Instead of waiting for the entire generation to complete, tokens are streamed as they are produced. This pattern uses:

SSE (Server-Sent Events): HTTP-based streaming where each event contains a single token
Non-streaming fallback: Full response returned as a single JSON object
Response metadata: Finish reason, token counts, timing headers, and optional per-token logprobs

The response includes important metadata in HTTP headers: x-compute-time, x-total-time, x-prompt-tokens, x-generated-tokens, and x-adapter-id.

Usage

Use streaming when building interactive applications (chatbots, code assistants) where time-to-first-token matters. Use non-streaming for batch processing where complete responses are needed.

Theoretical Basis

Pseudo-code:

# SSE streaming pattern
if request.stream:
    for token in generate_tokens(request):
        yield SSEEvent(data=StreamResponse(
            token=token,
            generated_text=None,  # only on last token
        ))
    yield SSEEvent(data=StreamResponse(
        token=last_token,
        generated_text=full_text,
        details=StreamDetails(finish_reason, token_count),
    ))
else:
    tokens = generate_all_tokens(request)
    return Response(generated_text=join(tokens), details=Details(...))

Related Pages

Implemented By

Implementation:Predibase_Lorax_Generate_Response_Handler

Uses Heuristic

Heuristic:Predibase_Lorax_GPU_Sampling_Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment