Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Pola rs Polars Streaming Sink Execution

From Leeroopedia


Knowledge Sources
Domains Data Engineering, Streaming
Last Updated 2026-02-09 10:00 GMT

Overview

Writing streaming query results directly to output files without fully materializing the result in memory, enabling end-to-end out-of-core data transformation.

Description

While collect(engine="streaming") processes data in batches, it still materializes the final result as an in-memory DataFrame. For truly end-to-end out-of-core workflows -- where neither the input nor the output fits in memory -- sink operations write streaming results directly to disk or cloud storage as batches are processed.

The streaming sink execution principle works as follows:

  1. Direct-to-disk writing -- Sink operations (sink_parquet, sink_ipc, sink_csv) write each processed batch directly to the output file as it completes. The final result is never fully materialized in memory, making it possible to transform a 100 GB input into a 100 GB output on a machine with only a few GB of RAM.
  2. Partitioned output via PartitionBy -- The path parameter can accept a pl.PartitionBy object that specifies Hive-style partitioning. This splits the output into multiple files based on column values and/or a maximum row count per file (max_rows_per_file), which is essential for producing datasets that are themselves scannable by downstream streaming pipelines.
  3. Multiplexed writes -- A single query can be written to multiple sinks simultaneously. By passing lazy=True to a sink method, it returns a LazyFrame instead of immediately executing. Multiple such lazy sinks can be collected together via pl.collect_all(), ensuring the source data is read only once while being written to multiple output formats or locations.
  4. Format flexibility -- Sink operations support Parquet, IPC (Arrow/Feather), and CSV output formats, enabling pipelines to transform between formats as part of the streaming workflow.

Sink operations are the exit point for out-of-core pipelines, complementing scan operations (the entry point). Together, they form a complete streaming data transformation architecture.

Usage

Use streaming sink execution whenever you need to:

  • Transform large datasets without materializing the result in memory.
  • Write query output directly to Parquet, IPC, or CSV files.
  • Produce partitioned output datasets for downstream consumption.
  • Write a single query result to multiple output formats simultaneously.
  • Build end-to-end ETL pipelines on machines with limited memory.

Theoretical Basis

Streaming sink execution draws on concepts from data pipeline sinks, partitioned output, and multiplexed writes in distributed data processing.

Data pipeline sinks are terminal operators that consume processed data and write it to persistent storage. In the streaming model, the sink operates as a consumer in a producer-consumer pattern:

while batch = pipeline.next():
    sink.write(batch)
sink.finalize()  # flush buffers, write metadata

Partitioned output organizes data into a directory structure based on column values (Hive-style partitioning) or row count limits:

PartitionBy(base_path, by=["year", "month"], max_rows_per_file=N)
  -> base_path/year=2024/month=01/part-0.parquet
  -> base_path/year=2024/month=02/part-0.parquet
  -> ...

This structure enables efficient partition pruning when the output is later scanned by another streaming pipeline.

Multiplexed writes allow a single data stream to feed multiple sinks without re-reading the source:

q1 = pipeline.sink_parquet("out.parquet", lazy=True)
q2 = pipeline.sink_csv("out.csv", lazy=True)
pl.collect_all([q1, q2])  # single read, two writes

This is equivalent to the tee pattern in Unix pipelines, where a single input stream is duplicated to multiple outputs.

Concept Relevance
Data pipeline sinks Terminal operators that write processed batches directly to storage
Partitioned output Hive-style directory partitioning for efficient downstream scanning
Multiplexed writes Single input stream feeding multiple output sinks simultaneously
Producer-consumer pattern Pipeline produces batches that the sink consumes and persists

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment