Implementation:Pola rs Polars LazyFrame Collect
Appearance
Overview
This implementation covers the concrete APIs for executing a lazy query plan and materializing the results into an in-memory DataFrame. The collect() method is the primary execution trigger, with options for streaming execution. The collect_all() function enables parallel execution of multiple independent queries.
APIs
LazyFrame.collect() -> DataFrame— Execute the query plan and return a DataFrameLazyFrame.collect(engine="streaming") -> DataFrame— Execute the query plan in streaming mode for reduced memory usagepl.collect_all(queries: list[LazyFrame]) -> list[DataFrame]— Execute multiple independent query plans in parallelLazyFrame.head(n) -> LazyFrame— Limit the query to the first n rows (enables slice pushdown)
Source Reference
- File:
docs/source/src/python/user-guide/lazy/execution.py(Lines 1-40) - Repository: Pola_rs_Polars
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | LazyFrame |
A LazyFrame containing a complete query plan ready for execution |
Input (collect_all) |
list[LazyFrame] |
A list of independent LazyFrames to execute in parallel |
Output (collect) |
DataFrame |
An in-memory DataFrame containing the materialized query results |
Output (collect_all) |
list[DataFrame] |
A list of DataFrames corresponding to the input LazyFrames |
Key Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
engine |
str |
"cpu" |
Execution engine: "cpu" (default in-memory) or "streaming" (batch-based streaming)
|
n (for head) |
int |
5 |
Number of rows to limit the result to; enables slice pushdown optimization |
Example Code
Standard Collection
import polars as pl
q = (
pl.scan_csv("data.csv")
.filter(pl.col("a") > 5)
.select(pl.col("a"), pl.col("b"))
)
# Execute and materialize
df = q.collect()
print(df)
Streaming Collection
import polars as pl
q = (
pl.scan_parquet("large_dataset.parquet")
.filter(pl.col("status") == "active")
.group_by("region")
.agg(pl.col("revenue").sum())
)
# Streaming execution for large datasets
df = q.collect(engine="streaming")
Parallel Collection of Multiple Queries
import polars as pl
q1 = pl.scan_csv("sales.csv").filter(pl.col("year") == 2025)
q2 = pl.scan_csv("inventory.csv").group_by("product").agg(pl.col("quantity").sum())
q3 = pl.scan_csv("customers.csv").select(pl.col("id"), pl.col("name"))
# Execute all three queries in parallel
results = pl.collect_all([q1, q2, q3])
sales_df = results[0]
inventory_df = results[1]
customers_df = results[2]
Head with Slice Pushdown
import polars as pl
q = (
pl.scan_csv("data.csv")
.sort("score", descending=True)
.head(10)
)
# Only the top 10 rows are materialized
df = q.collect()
Import
import polars as pl
Behavior Notes
- Optimization runs automatically: When
collect()is called, the query optimizer runs before execution. There is no need to manually trigger optimization. - collect() is terminal: After collection, the result is an eager DataFrame. To continue with lazy operations, call
.lazy()on the resulting DataFrame. - Streaming mode limitations: Not all operations are supported in streaming mode. Operations that require full materialization (e.g., certain join types, global sorts) may fall back to in-memory execution or raise an error.
- collect_all() enables parallelism: Independent queries passed to
collect_all()may execute concurrently, utilizing multiple CPU cores. This is more efficient than collecting each query sequentially. - head() enables slice pushdown: When
head(n)is used, the optimizer can push the row limit down through the plan, potentially stopping file reads early. - Memory considerations: For datasets that do not fit in memory, use
engine="streaming"to process data in batches. The streaming engine maintains bounded memory usage at the cost of some throughput.
Related Pages
- Principle:Pola_rs_Polars_Lazy_Query_Collection
- Implementation:Pola_rs_Polars_Scan_LazyFrame_Creation
- Implementation:Pola_rs_Polars_LazyFrame_Expression_Chaining
- Implementation:Pola_rs_Polars_DataFrame_Write_and_Convert
- Environment:Pola_rs_Polars_Python_Runtime_Environment
- Environment:Pola_rs_Polars_GPU_Execution_Environment
- Heuristic:Pola_rs_Polars_Lazy_Over_Eager_Preference
- Heuristic:Pola_rs_Polars_Collect_All_For_Diverging_Queries
- Heuristic:Pola_rs_Polars_GPU_Aggregation_Join_Speedup
- Heuristic:Pola_rs_Polars_Use_Spawn_Not_Fork_Multiprocessing
Metadata
| Field | Value |
|---|---|
| Source Repository | Pola_rs_Polars |
| Source File | docs/source/src/python/user-guide/lazy/execution.py:L1-40
|
| Domain | Data Engineering, Query Execution, Materialization |
| Last Updated | 2026-02-09 10:00 GMT |
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment