Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Pola rs Polars Scan for Streaming

From Leeroopedia


Knowledge Sources
Domains Data Engineering, Streaming
Last Updated 2026-02-09 10:00 GMT

Overview

Concrete scan functions that create LazyFrame objects from file-based data sources, supporting glob patterns for partitioned datasets and deferring all I/O until streaming execution.

Description

Polars provides a family of scan_* functions, one for each supported file format. Each function accepts a source parameter (a file path, glob pattern, or cloud URI) and returns a LazyFrame without reading any row data. The LazyFrame captures the schema and file references needed for downstream query planning and optimization.

These scan functions are the entry point for all streaming and out-of-core workflows. When the resulting LazyFrame is later collected with engine="streaming" or written via a sink_* method, the streaming engine reads data in batches from the scanned sources.

Usage

Use these scan functions whenever you need to:

  • Build a lazy query against CSV, Parquet, NDJSON, or IPC files.
  • Process multi-file datasets using glob patterns.
  • Enable streaming execution for larger-than-RAM datasets.
  • Access cloud-hosted data via S3, GCS, or Azure URIs.

Code Reference

Source Location

  • Repository: Polars
  • File: docs/source/src/python/user-guide/concepts/streaming.py (line 9)

Signature

import polars as pl

# CSV scanning
pl.scan_csv(source: str | Path) -> LazyFrame

# Parquet scanning
pl.scan_parquet(source: str | Path) -> LazyFrame

# NDJSON scanning
pl.scan_ndjson(source: str | Path) -> LazyFrame

# IPC (Arrow/Feather) scanning
pl.scan_ipc(source: str | Path) -> LazyFrame

Import

import polars as pl

I/O Contract

Inputs

Name Type Required Description
source Path Yes File path, glob pattern (e.g., "data/*.csv"), or cloud URI (e.g., "s3://bucket/data/**/*.parquet")

Outputs

Name Type Description
result LazyFrame A lazy query plan node referencing the scanned data source. No row data is loaded. Schema metadata is available immediately.

Usage Examples

Scan a Single CSV File

import polars as pl

# Scan a single CSV -- no data is loaded yet
lf = pl.scan_csv("large_file.csv")

# Inspect the schema without reading data
print(lf.collect_schema())

Scan Multiple Parquet Files with Glob

import polars as pl

# Glob pattern discovers all .parquet files in the directory
lf = pl.scan_parquet("my_dataset/*.parquet")

# Each matched file becomes a partition in the scan node
print(lf.collect_schema())

Scan Cloud Data

import polars as pl

# S3 URI with recursive glob
lf = pl.scan_parquet("s3://bucket/data/**/*.parquet")

# Azure Blob Storage
lf = pl.scan_parquet("az://container/path/*.parquet")

Complete Streaming Pipeline Starting from Scan

import polars as pl

# Scan is the entry point for streaming
q = (
    pl.scan_csv("docs/assets/data/iris.csv")
    .filter(pl.col("sepal_length") > 5)
    .group_by("species")
    .agg(pl.col("sepal_width").mean())
)

# Execute with streaming engine
df = q.collect(engine="streaming")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment