Principle:Pola rs Polars Streaming Data Scanning
| Knowledge Sources | |
|---|---|
| Domains | Data Engineering, Streaming |
| Last Updated | 2026-02-09 10:00 GMT |
Overview
Lazily scanning data sources that may exceed available memory, leveraging glob patterns for multi-file datasets and deferring all I/O until streaming execution.
Description
When datasets grow beyond the capacity of available RAM, loading them entirely into memory before processing becomes infeasible. Streaming data scanning solves this by creating query plan nodes that reference data partitions without actually loading them. The scan operation reads only schema metadata (column names, types, row-group statistics) and registers the data source as a leaf node in a logical plan DAG.
The core design works as follows:
- Deferred I/O -- Scan functions such as
scan_csv,scan_parquet,scan_ndjson, andscan_ipcreturn aLazyFrameimmediately without reading any row data. The actual file reads are deferred until a terminal action (collect or sink) triggers execution. - Glob-based partition discovery -- The
sourceparameter accepts glob patterns (e.g.,"data/*.parquet"or"s3://bucket/**/*.csv"), enabling automatic discovery of partitioned datasets spread across many files. Each matched file becomes a separate partition in the scan node, allowing the streaming engine to process them independently. - Schema propagation -- The scan node exposes the discovered schema (column names and data types) to downstream plan nodes. This enables the query optimizer to perform projection pushdown (reading only needed columns) and predicate pushdown (skipping irrelevant row groups) before any data is materialized.
- Cloud and local transparency -- The same scan API works with local file paths and cloud storage URIs (S3, GCS, Azure Blob). The storage backend is resolved from the URI scheme, making pipelines portable across environments.
This principle is the entry point for all out-of-core processing pipelines in Polars. Without a lazy scan, data must be loaded eagerly via pl.read_*, which requires the full dataset to fit in memory.
Usage
Use streaming data scanning whenever you need to:
- Process datasets larger than available RAM by deferring I/O to the streaming engine.
- Work with partitioned datasets stored as many files in a directory.
- Build query pipelines against cloud-hosted data without downloading entire datasets first.
- Enable projection and predicate pushdown optimizations that reduce I/O at the source.
Theoretical Basis
Streaming data scanning draws on several well-established concepts from database systems and algorithm design:
Out-of-core algorithms process data that does not fit in main memory by reading it in blocks from secondary storage. The streaming scan implements this by treating each file (or row group within a file) as a block that can be loaded, processed, and discarded independently.
External sorting and I/O scheduling research shows that minimizing the number of I/O operations is critical for performance. By deferring reads and enabling pushdown optimizations, the scan node ensures that only the required columns and rows are read from disk or network.
Formally, let D = {f_1, f_2, ..., f_n} be the set of files matched by a glob pattern, and let S(f_i) be the schema of file f_i. The scan operation produces:
ScanNode(D) = LazyFrame with schema S = union(S(f_1), ..., S(f_n))
where data(f_i) is not loaded until execution time
and projection P subset of S reduces I/O to only columns in P
The key insight is that the scan node carries enough metadata (schema, file paths, row-group statistics) to enable plan optimization without touching the actual data.
| Concept | Relevance |
|---|---|
| Out-of-core algorithms | Data is processed in blocks from storage, never fully materialized |
| I/O scheduling | Reads are deferred and minimized through pushdown optimizations |
| Partition discovery | Glob patterns automate identification of dataset partitions |
| Schema inference | Metadata is read eagerly to enable downstream optimization |