Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Eventual Inc Daft Streaming Row Iteration

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Streaming
Last Updated 2026-02-08 00:00 GMT

Overview

Technique for streaming DataFrame results row-by-row without materializing the entire dataset in memory.

Description

Streaming iteration yields rows one at a time from a buffered execution pipeline, enabling processing of datasets larger than memory. Supports configurable buffer sizes to control the tradeoff between throughput and memory consumption. Rows can be returned in either Python-native format (with type coercion) or as Arrow scalars for efficient handling of nested data.

Usage

Use streaming row iteration when you need to process results incrementally without loading all data into memory. Common scenarios include writing rows to an external system, streaming results to a client, or processing datasets that exceed available RAM.

Theoretical Basis

Iterator-based streaming pattern with backpressure through configurable buffer sizes. The execution pipeline produces partitions asynchronously while the consumer pulls rows on demand:

buffer = BoundedQueue(size=results_buffer_size)

# Producer (async)
for each partition P in execution_plan:
    buffer.put(P)  # blocks when buffer full (backpressure)

# Consumer (iterator)
for each partition P in buffer:
    for each row R in P:
        yield dict(col_name -> R[col])

Setting results_buffer_size=None removes the buffer limit, allowing maximum throughput at the cost of higher memory usage.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment