Heuristic:Pola rs Polars Lazy Over Eager Preference
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Query_Planning |
| Last Updated | 2026-02-09 10:00 GMT |
Overview
Always prefer the Polars Lazy API over the Eager API to enable automatic query optimization, reduce memory usage, and unlock streaming execution for larger-than-memory datasets.
Description
Polars supports two execution modes: eager (immediate execution) and lazy (deferred execution with optimization). The lazy API builds a query plan that is optimized before execution, enabling predicate pushdown, projection pushdown, slice pushdown, common subplan elimination, join ordering, and expression simplification. The eager API actually calls the lazy API internally for many operations, but starting with the lazy API from the beginning (via `scan_*` functions) allows the full optimization pipeline to run, including pushing filters and projections down to the data source level.
Usage
Use the lazy API (via `pl.scan_csv`, `pl.scan_parquet`, `df.lazy()`, etc.) for all production workflows. Only use the eager API (`pl.read_csv`, direct DataFrame operations) during interactive exploration when you need to inspect intermediate results and don't yet know what your final query will look like.
The Insight (Rule of Thumb)
- Action: Use `pl.scan_*` instead of `pl.read_*` as the entry point for queries. Chain expressions with `.select()`, `.filter()`, `.group_by()`, etc., then call `.collect()` at the end.
- Value: Enables 8 automatic optimizations: predicate pushdown, projection pushdown, slice pushdown, common subplan elimination, expression simplification, join ordering, type coercion, and cardinality estimation.
- Trade-off: Cannot inspect intermediate results without collecting. Schema errors are caught at plan-build time rather than incrementally.
Reasoning
The lazy API enables Polars to see the entire query before executing any part of it. Consider a query that filters rows and selects columns: with the eager API, Polars first reads all rows and all columns, then filters, then selects. With the lazy API, Polars pushes the filter to the scan level (reading only matching rows) and only reads the needed columns. This dramatically reduces I/O and memory usage. The Polars documentation explicitly states: "the lazy API should be preferred unless you are either interested in the intermediate results or are doing exploratory work."
The lazy API also unlocks streaming execution via `collect(engine="streaming")`, enabling processing of datasets larger than available memory.
Optimizations applied by the lazy engine:
| Optimization | What It Does | Runs |
|---|---|---|
| Predicate pushdown | Applies filters at scan level | 1 time |
| Projection pushdown | Selects only needed columns at scan level | 1 time |
| Slice pushdown | Only loads required slice from scan | 1 time |
| Common subplan elimination | Caches shared subtrees/file scans | 1 time |
| Expression simplification | Constant folding, faster alternatives | Until fixed point |
| Join ordering | Executes smaller join branches first | 1 time |
| Type coercion | Minimal memory type promotion | Until fixed point |
| Cardinality estimation | Optimal group-by strategy selection | 0/n times |