Principle:Eventual Inc Daft Streaming Row Iteration
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Streaming |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Technique for streaming DataFrame results row-by-row without materializing the entire dataset in memory.
Description
Streaming iteration yields rows one at a time from a buffered execution pipeline, enabling processing of datasets larger than memory. Supports configurable buffer sizes to control the tradeoff between throughput and memory consumption. Rows can be returned in either Python-native format (with type coercion) or as Arrow scalars for efficient handling of nested data.
Usage
Use streaming row iteration when you need to process results incrementally without loading all data into memory. Common scenarios include writing rows to an external system, streaming results to a client, or processing datasets that exceed available RAM.
Theoretical Basis
Iterator-based streaming pattern with backpressure through configurable buffer sizes. The execution pipeline produces partitions asynchronously while the consumer pulls rows on demand:
buffer = BoundedQueue(size=results_buffer_size)
# Producer (async)
for each partition P in execution_plan:
buffer.put(P) # blocks when buffer full (backpressure)
# Consumer (iterator)
for each partition P in buffer:
for each row R in P:
yield dict(col_name -> R[col])
Setting results_buffer_size=None removes the buffer limit, allowing maximum throughput at the cost of higher memory usage.