Implementation:Pola rs Polars Scan for Streaming
| Knowledge Sources | |
|---|---|
| Domains | Data Engineering, Streaming |
| Last Updated | 2026-02-09 10:00 GMT |
Overview
Concrete scan functions that create LazyFrame objects from file-based data sources, supporting glob patterns for partitioned datasets and deferring all I/O until streaming execution.
Description
Polars provides a family of scan_* functions, one for each supported file format. Each function accepts a source parameter (a file path, glob pattern, or cloud URI) and returns a LazyFrame without reading any row data. The LazyFrame captures the schema and file references needed for downstream query planning and optimization.
These scan functions are the entry point for all streaming and out-of-core workflows. When the resulting LazyFrame is later collected with engine="streaming" or written via a sink_* method, the streaming engine reads data in batches from the scanned sources.
Usage
Use these scan functions whenever you need to:
- Build a lazy query against CSV, Parquet, NDJSON, or IPC files.
- Process multi-file datasets using glob patterns.
- Enable streaming execution for larger-than-RAM datasets.
- Access cloud-hosted data via S3, GCS, or Azure URIs.
Code Reference
Source Location
- Repository: Polars
- File:
docs/source/src/python/user-guide/concepts/streaming.py(line 9)
Signature
import polars as pl
# CSV scanning
pl.scan_csv(source: str | Path) -> LazyFrame
# Parquet scanning
pl.scan_parquet(source: str | Path) -> LazyFrame
# NDJSON scanning
pl.scan_ndjson(source: str | Path) -> LazyFrame
# IPC (Arrow/Feather) scanning
pl.scan_ipc(source: str | Path) -> LazyFrame
Import
import polars as pl
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| source | Path | Yes | File path, glob pattern (e.g., "data/*.csv"), or cloud URI (e.g., "s3://bucket/data/**/*.parquet")
|
Outputs
| Name | Type | Description |
|---|---|---|
| result | LazyFrame |
A lazy query plan node referencing the scanned data source. No row data is loaded. Schema metadata is available immediately. |
Usage Examples
Scan a Single CSV File
import polars as pl
# Scan a single CSV -- no data is loaded yet
lf = pl.scan_csv("large_file.csv")
# Inspect the schema without reading data
print(lf.collect_schema())
Scan Multiple Parquet Files with Glob
import polars as pl
# Glob pattern discovers all .parquet files in the directory
lf = pl.scan_parquet("my_dataset/*.parquet")
# Each matched file becomes a partition in the scan node
print(lf.collect_schema())
Scan Cloud Data
import polars as pl
# S3 URI with recursive glob
lf = pl.scan_parquet("s3://bucket/data/**/*.parquet")
# Azure Blob Storage
lf = pl.scan_parquet("az://container/path/*.parquet")
Complete Streaming Pipeline Starting from Scan
import polars as pl
# Scan is the entry point for streaming
q = (
pl.scan_csv("docs/assets/data/iris.csv")
.filter(pl.col("sepal_length") > 5)
.group_by("species")
.agg(pl.col("sepal_width").mean())
)
# Execute with streaming engine
df = q.collect(engine="streaming")