Heuristic:Pola rs Polars Collect All For Diverging Queries
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Query_Planning |
| Last Updated | 2026-02-09 10:00 GMT |
Overview
Use `pl.collect_all()` instead of separate `.collect()` calls when multiple lazy queries share common subexpressions, ensuring shared computation is executed only once.
Description
When a LazyFrame query diverges into multiple downstream queries, each separate `.collect()` call recomputes the entire shared query plan from scratch. The `pl.collect_all()` function accepts a list of LazyFrames and detects common subplans, executing shared portions only once through the common subplan elimination optimization. This is particularly important for expensive operations like large file scans, complex joins, or aggregations that are reused across multiple output queries.
Usage
Apply this heuristic whenever you define a base LazyFrame and then derive multiple queries from it. This is common in ETL pipelines where the same data is aggregated in different ways, or in analytical workflows where summary statistics and detail records are computed from the same source.
The Insight (Rule of Thumb)
- Action: Instead of calling `.collect()` on each LazyFrame separately, pass all LazyFrames to `pl.collect_all([lf_1, lf_2, ...])`.
- Value: The shared LazyFrame is computed only once. For expensive base queries (large file scans, complex transformations), this can halve or more the total computation time.
- Trade-off: Slightly more verbose code. All queries must be ready to execute at the same point in the program.
Reasoning
The Polars documentation states: "LazyFrames are query plans i.e. a promise on computation and is not guaranteed to cache common subplans. This means that every time you reuse it in separate downstream queries after it is defined, it is computed all over again." The `collect_all` function solves this by analyzing all query plans together and finding shared subplans.
Additionally, for non-deterministic operations: "If you define an operation on a LazyFrame that doesn't maintain row order (such as a group_by), then the order will also change every time it is run." Using `collect_all` ensures consistent results across derived queries since the shared computation runs once.
Example from the documentation:
# Some expensive LazyFrame
lf: LazyFrame
lf_1 = lf.select(pl.all().sum())
lf_2 = lf.some_other_computation()
pl.collect_all([lf_1, lf_2]) # this will execute lf only once!