Principle:Pola rs Polars Streaming Sink Execution

Knowledge Sources	Polars Polars Docs
Domains	Data Engineering, Streaming
Last Updated	2026-02-09 10:00 GMT

Overview

Writing streaming query results directly to output files without fully materializing the result in memory, enabling end-to-end out-of-core data transformation.

Description

While collect(engine="streaming") processes data in batches, it still materializes the final result as an in-memory DataFrame. For truly end-to-end out-of-core workflows -- where neither the input nor the output fits in memory -- sink operations write streaming results directly to disk or cloud storage as batches are processed.

The streaming sink execution principle works as follows:

Direct-to-disk writing -- Sink operations (sink_parquet, sink_ipc, sink_csv) write each processed batch directly to the output file as it completes. The final result is never fully materialized in memory, making it possible to transform a 100 GB input into a 100 GB output on a machine with only a few GB of RAM.
Partitioned output via PartitionBy -- The path parameter can accept a pl.PartitionBy object that specifies Hive-style partitioning. This splits the output into multiple files based on column values and/or a maximum row count per file (max_rows_per_file), which is essential for producing datasets that are themselves scannable by downstream streaming pipelines.
Multiplexed writes -- A single query can be written to multiple sinks simultaneously. By passing lazy=True to a sink method, it returns a LazyFrame instead of immediately executing. Multiple such lazy sinks can be collected together via pl.collect_all(), ensuring the source data is read only once while being written to multiple output formats or locations.
Format flexibility -- Sink operations support Parquet, IPC (Arrow/Feather), and CSV output formats, enabling pipelines to transform between formats as part of the streaming workflow.

Sink operations are the exit point for out-of-core pipelines, complementing scan operations (the entry point). Together, they form a complete streaming data transformation architecture.

Usage

Use streaming sink execution whenever you need to:

Transform large datasets without materializing the result in memory.
Write query output directly to Parquet, IPC, or CSV files.
Produce partitioned output datasets for downstream consumption.
Write a single query result to multiple output formats simultaneously.
Build end-to-end ETL pipelines on machines with limited memory.

Theoretical Basis

Streaming sink execution draws on concepts from data pipeline sinks, partitioned output, and multiplexed writes in distributed data processing.

Data pipeline sinks are terminal operators that consume processed data and write it to persistent storage. In the streaming model, the sink operates as a consumer in a producer-consumer pattern:

while batch = pipeline.next():
    sink.write(batch)
sink.finalize()  # flush buffers, write metadata

Partitioned output organizes data into a directory structure based on column values (Hive-style partitioning) or row count limits:

PartitionBy(base_path, by=["year", "month"], max_rows_per_file=N)
  -> base_path/year=2024/month=01/part-0.parquet
  -> base_path/year=2024/month=02/part-0.parquet
  -> ...

This structure enables efficient partition pruning when the output is later scanned by another streaming pipeline.

Multiplexed writes allow a single data stream to feed multiple sinks without re-reading the source:

q1 = pipeline.sink_parquet("out.parquet", lazy=True)
q2 = pipeline.sink_csv("out.csv", lazy=True)
pl.collect_all([q1, q2])  # single read, two writes

This is equivalent to the tee pattern in Unix pipelines, where a single input stream is duplicated to multiple outputs.

Concept	Relevance
Data pipeline sinks	Terminal operators that write processed batches directly to storage
Partitioned output	Hive-style directory partitioning for efficient downstream scanning
Multiplexed writes	Single input stream feeding multiple output sinks simultaneously
Producer-consumer pattern	Pipeline produces batches that the sink consumes and persists

Related Pages

Implemented By

Implementation:Pola_rs_Polars_Sink_Operations

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment