Implementation:Pola rs Polars LazyFrame Collect

Overview

This implementation covers the concrete APIs for executing a lazy query plan and materializing the results into an in-memory DataFrame. The collect() method is the primary execution trigger, with options for streaming execution. The collect_all() function enables parallel execution of multiple independent queries.

APIs

LazyFrame.collect() -> DataFrame — Execute the query plan and return a DataFrame
LazyFrame.collect(engine="streaming") -> DataFrame — Execute the query plan in streaming mode for reduced memory usage
pl.collect_all(queries: list[LazyFrame]) -> list[DataFrame] — Execute multiple independent query plans in parallel
LazyFrame.head(n) -> LazyFrame — Limit the query to the first n rows (enables slice pushdown)

Source Reference

File: docs/source/src/python/user-guide/lazy/execution.py (Lines 1-40)
Repository: Pola_rs_Polars

I/O Contract

Direction	Type	Description
Input	`LazyFrame`	A LazyFrame containing a complete query plan ready for execution
Input (`collect_all`)	`list[LazyFrame]`	A list of independent LazyFrames to execute in parallel
Output (`collect`)	`DataFrame`	An in-memory DataFrame containing the materialized query results
Output (`collect_all`)	`list[DataFrame]`	A list of DataFrames corresponding to the input LazyFrames

Key Parameters

Parameter	Type	Default	Description
`engine`	`str`	`"cpu"`	Execution engine: `"cpu"` (default in-memory) or `"streaming"` (batch-based streaming)
`n` (for `head`)	`int`	`5`	Number of rows to limit the result to; enables slice pushdown optimization

Example Code

Standard Collection

import polars as pl

q = (
    pl.scan_csv("data.csv")
    .filter(pl.col("a") > 5)
    .select(pl.col("a"), pl.col("b"))
)

# Execute and materialize
df = q.collect()
print(df)

Streaming Collection

import polars as pl

q = (
    pl.scan_parquet("large_dataset.parquet")
    .filter(pl.col("status") == "active")
    .group_by("region")
    .agg(pl.col("revenue").sum())
)

# Streaming execution for large datasets
df = q.collect(engine="streaming")

Parallel Collection of Multiple Queries

import polars as pl

q1 = pl.scan_csv("sales.csv").filter(pl.col("year") == 2025)
q2 = pl.scan_csv("inventory.csv").group_by("product").agg(pl.col("quantity").sum())
q3 = pl.scan_csv("customers.csv").select(pl.col("id"), pl.col("name"))

# Execute all three queries in parallel
results = pl.collect_all([q1, q2, q3])

sales_df = results[0]
inventory_df = results[1]
customers_df = results[2]

Head with Slice Pushdown

import polars as pl

q = (
    pl.scan_csv("data.csv")
    .sort("score", descending=True)
    .head(10)
)

# Only the top 10 rows are materialized
df = q.collect()

Import

import polars as pl

Behavior Notes

Optimization runs automatically: When collect() is called, the query optimizer runs before execution. There is no need to manually trigger optimization.
collect() is terminal: After collection, the result is an eager DataFrame. To continue with lazy operations, call .lazy() on the resulting DataFrame.
Streaming mode limitations: Not all operations are supported in streaming mode. Operations that require full materialization (e.g., certain join types, global sorts) may fall back to in-memory execution or raise an error.
collect_all() enables parallelism: Independent queries passed to collect_all() may execute concurrently, utilizing multiple CPU cores. This is more efficient than collecting each query sequentially.
head() enables slice pushdown: When head(n) is used, the optimizer can push the row limit down through the plan, potentially stopping file reads early.
Memory considerations: For datasets that do not fit in memory, use engine="streaming" to process data in batches. The streaming engine maintains bounded memory usage at the cost of some throughput.

Related Pages

Metadata

Field	Value
Source Repository	Pola_rs_Polars
Source File	`docs/source/src/python/user-guide/lazy/execution.py:L1-40`
Domain	Data Engineering, Query Execution, Materialization
Last Updated	2026-02-09 10:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment