Principle:Pola rs Polars Streaming Sink Execution
| Knowledge Sources | |
|---|---|
| Domains | Data Engineering, Streaming |
| Last Updated | 2026-02-09 10:00 GMT |
Overview
Writing streaming query results directly to output files without fully materializing the result in memory, enabling end-to-end out-of-core data transformation.
Description
While collect(engine="streaming") processes data in batches, it still materializes the final result as an in-memory DataFrame. For truly end-to-end out-of-core workflows -- where neither the input nor the output fits in memory -- sink operations write streaming results directly to disk or cloud storage as batches are processed.
The streaming sink execution principle works as follows:
- Direct-to-disk writing -- Sink operations (
sink_parquet,sink_ipc,sink_csv) write each processed batch directly to the output file as it completes. The final result is never fully materialized in memory, making it possible to transform a 100 GB input into a 100 GB output on a machine with only a few GB of RAM. - Partitioned output via PartitionBy -- The
pathparameter can accept apl.PartitionByobject that specifies Hive-style partitioning. This splits the output into multiple files based on column values and/or a maximum row count per file (max_rows_per_file), which is essential for producing datasets that are themselves scannable by downstream streaming pipelines. - Multiplexed writes -- A single query can be written to multiple sinks simultaneously. By passing
lazy=Trueto a sink method, it returns aLazyFrameinstead of immediately executing. Multiple such lazy sinks can be collected together viapl.collect_all(), ensuring the source data is read only once while being written to multiple output formats or locations. - Format flexibility -- Sink operations support Parquet, IPC (Arrow/Feather), and CSV output formats, enabling pipelines to transform between formats as part of the streaming workflow.
Sink operations are the exit point for out-of-core pipelines, complementing scan operations (the entry point). Together, they form a complete streaming data transformation architecture.
Usage
Use streaming sink execution whenever you need to:
- Transform large datasets without materializing the result in memory.
- Write query output directly to Parquet, IPC, or CSV files.
- Produce partitioned output datasets for downstream consumption.
- Write a single query result to multiple output formats simultaneously.
- Build end-to-end ETL pipelines on machines with limited memory.
Theoretical Basis
Streaming sink execution draws on concepts from data pipeline sinks, partitioned output, and multiplexed writes in distributed data processing.
Data pipeline sinks are terminal operators that consume processed data and write it to persistent storage. In the streaming model, the sink operates as a consumer in a producer-consumer pattern:
while batch = pipeline.next():
sink.write(batch)
sink.finalize() # flush buffers, write metadata
Partitioned output organizes data into a directory structure based on column values (Hive-style partitioning) or row count limits:
PartitionBy(base_path, by=["year", "month"], max_rows_per_file=N)
-> base_path/year=2024/month=01/part-0.parquet
-> base_path/year=2024/month=02/part-0.parquet
-> ...
This structure enables efficient partition pruning when the output is later scanned by another streaming pipeline.
Multiplexed writes allow a single data stream to feed multiple sinks without re-reading the source:
q1 = pipeline.sink_parquet("out.parquet", lazy=True)
q2 = pipeline.sink_csv("out.csv", lazy=True)
pl.collect_all([q1, q2]) # single read, two writes
This is equivalent to the tee pattern in Unix pipelines, where a single input stream is duplicated to multiple outputs.
| Concept | Relevance |
|---|---|
| Data pipeline sinks | Terminal operators that write processed batches directly to storage |
| Partitioned output | Hive-style directory partitioning for efficient downstream scanning |
| Multiplexed writes | Single input stream feeding multiple output sinks simultaneously |
| Producer-consumer pattern | Pipeline produces batches that the sink consumes and persists |