Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Pola rs Polars Streaming Data Scanning

From Leeroopedia


Knowledge Sources
Domains Data Engineering, Streaming
Last Updated 2026-02-09 10:00 GMT

Overview

Lazily scanning data sources that may exceed available memory, leveraging glob patterns for multi-file datasets and deferring all I/O until streaming execution.

Description

When datasets grow beyond the capacity of available RAM, loading them entirely into memory before processing becomes infeasible. Streaming data scanning solves this by creating query plan nodes that reference data partitions without actually loading them. The scan operation reads only schema metadata (column names, types, row-group statistics) and registers the data source as a leaf node in a logical plan DAG.

The core design works as follows:

  1. Deferred I/O -- Scan functions such as scan_csv, scan_parquet, scan_ndjson, and scan_ipc return a LazyFrame immediately without reading any row data. The actual file reads are deferred until a terminal action (collect or sink) triggers execution.
  2. Glob-based partition discovery -- The source parameter accepts glob patterns (e.g., "data/*.parquet" or "s3://bucket/**/*.csv"), enabling automatic discovery of partitioned datasets spread across many files. Each matched file becomes a separate partition in the scan node, allowing the streaming engine to process them independently.
  3. Schema propagation -- The scan node exposes the discovered schema (column names and data types) to downstream plan nodes. This enables the query optimizer to perform projection pushdown (reading only needed columns) and predicate pushdown (skipping irrelevant row groups) before any data is materialized.
  4. Cloud and local transparency -- The same scan API works with local file paths and cloud storage URIs (S3, GCS, Azure Blob). The storage backend is resolved from the URI scheme, making pipelines portable across environments.

This principle is the entry point for all out-of-core processing pipelines in Polars. Without a lazy scan, data must be loaded eagerly via pl.read_*, which requires the full dataset to fit in memory.

Usage

Use streaming data scanning whenever you need to:

  • Process datasets larger than available RAM by deferring I/O to the streaming engine.
  • Work with partitioned datasets stored as many files in a directory.
  • Build query pipelines against cloud-hosted data without downloading entire datasets first.
  • Enable projection and predicate pushdown optimizations that reduce I/O at the source.

Theoretical Basis

Streaming data scanning draws on several well-established concepts from database systems and algorithm design:

Out-of-core algorithms process data that does not fit in main memory by reading it in blocks from secondary storage. The streaming scan implements this by treating each file (or row group within a file) as a block that can be loaded, processed, and discarded independently.

External sorting and I/O scheduling research shows that minimizing the number of I/O operations is critical for performance. By deferring reads and enabling pushdown optimizations, the scan node ensures that only the required columns and rows are read from disk or network.

Formally, let D = {f_1, f_2, ..., f_n} be the set of files matched by a glob pattern, and let S(f_i) be the schema of file f_i. The scan operation produces:

ScanNode(D) = LazyFrame with schema S = union(S(f_1), ..., S(f_n))
  where data(f_i) is not loaded until execution time
  and projection P subset of S reduces I/O to only columns in P

The key insight is that the scan node carries enough metadata (schema, file paths, row-group statistics) to enable plan optimization without touching the actual data.

Concept Relevance
Out-of-core algorithms Data is processed in blocks from storage, never fully materialized
I/O scheduling Reads are deferred and minimized through pushdown optimizations
Partition discovery Glob patterns automate identification of dataset partitions
Schema inference Metadata is read eagerly to enable downstream optimization

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment