Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Pola rs Polars Streaming Query Building

From Leeroopedia


Knowledge Sources
Domains Data Engineering, Streaming
Last Updated 2026-02-09 10:00 GMT

Overview

Constructing query pipelines on lazily-scanned data using the same expression API as in-memory queries, enabling identical code for both in-memory and streaming execution.

Description

A fundamental design goal of Polars is that the query-building API is execution-engine agnostic. Whether a query will ultimately run in-memory or in streaming mode, the code that constructs the pipeline is identical. This means practitioners can prototype on small data using collect() and later switch to collect(engine="streaming") for production-scale data without changing a single line of query logic.

The streaming query building principle works as follows:

  1. Unified logical plan -- Every lazy operation (filter, select, with_columns, group_by, join, sort) appends a node to the same logical plan DAG regardless of the intended execution engine. The plan is a pure description of what to compute, not how to compute it.
  2. Deferred execution -- No computation occurs when query methods are called. Each method returns a new LazyFrame with an extended plan. This allows the full pipeline to be assembled before any optimization or execution begins.
  3. Engine-independent optimization -- The query optimizer applies the same transformations (predicate pushdown, projection pushdown, common subexpression elimination, slice pushdown) to the logical plan regardless of whether the streaming or in-memory engine will execute it.
  4. Expression composability -- The pl.col() expression system supports arbitrary compositions of column references, aggregations, window functions, and user-defined functions. These expressions are embedded in plan nodes and evaluated per-batch during streaming execution.

This design eliminates the common pain point in data engineering where "streaming code" and "batch code" are entirely separate codebases. In Polars, the only difference between in-memory and streaming execution is the terminal collect() call.

Usage

Use streaming query building whenever you need to:

  • Construct a multi-step data transformation pipeline on lazily-scanned data.
  • Build a query that will run identically in both in-memory and streaming modes.
  • Leverage optimizer passes (predicate pushdown, projection pushdown) to minimize I/O.
  • Compose filters, selections, aggregations, joins, and sorts in a single fluent pipeline.

Theoretical Basis

The streaming query building principle is rooted in query plan compilation from relational database theory.

A query plan is a directed acyclic graph (DAG) where:

Plan = Node(operation, children: list[Plan])
  where operation in {Scan, Filter, Select, GroupBy, Join, Sort, ...}

Each LazyFrame method creates a new plan node:

lf.filter(pred)       -> Plan(Filter(pred), [lf.plan])
lf.select(exprs)      -> Plan(Select(exprs), [lf.plan])
lf.group_by(by).agg(a)-> Plan(GroupBy(by, a), [lf.plan])
lf.join(other, on)    -> Plan(Join(on), [lf.plan, other.plan])

The optimizer rewrites the plan before execution:

optimize(plan) -> plan'
  where plan' is semantically equivalent to plan
  but with pushed-down predicates and projections

Unified query execution frameworks from the database literature demonstrate that separating logical plans from physical execution strategies yields significant engineering benefits. The logical plan captures intent; the physical plan captures strategy. In Polars, this separation means the same logical plan can be executed by either the in-memory or streaming physical engine.

Concept Relevance
Query plan compilation Lazy operations build a DAG that is optimized before execution
Unified execution frameworks Same logical plan supports both in-memory and streaming engines
Predicate pushdown Filters are moved closer to data sources to reduce I/O
Projection pushdown Only required columns are read from source files

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment