Implementation:Pola rs Polars Read and Scan Operations
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ETL, File_Format_Parsing |
| Last Updated | 2026-02-09 10:00 GMT |
Overview
Concrete read and scan functions for ingesting data from CSV, Parquet, JSON, NDJSON, IPC, Excel, and database sources into Polars DataFrames and LazyFrames.
Description
The Read and Scan Operations provide the primary data ingestion interface in Polars. The read_* family of functions performs eager loading, immediately parsing and materializing data into a DataFrame. The scan_* family creates a LazyFrame with a deferred query plan, enabling predicate and projection pushdown optimizations. Both families support local files, glob patterns, cloud URIs, and URLs.
Usage
Import polars and call the appropriate read or scan function for the target format. Use read_* for immediate access to small/medium datasets. Use scan_* followed by .collect() for large datasets or optimized pipelines. Pass storage_options or credential_provider when accessing cloud storage.
Code Reference
Source Location
- Repository: polars
- Files:
- docs/source/src/python/user-guide/io/csv.py (Lines: 1-19)
- docs/source/src/python/user-guide/io/parquet.py (Lines: 1-19)
- docs/source/src/python/user-guide/io/json.py (Lines: 1-27)
Signature
# Eager read functions
pl.read_csv(source: str, try_parse_dates: bool = False, ...) -> DataFrame
pl.read_parquet(source: str, ...) -> DataFrame
pl.read_json(source: str, ...) -> DataFrame
pl.read_ndjson(source: str, ...) -> DataFrame
pl.read_excel(source: str, sheet_name: str = None, ...) -> DataFrame
pl.read_database_uri(query: str, uri: str, ...) -> DataFrame
# Lazy scan functions
pl.scan_csv(source: str, ...) -> LazyFrame
pl.scan_parquet(source: str, hive_partitioning: bool = False, ...) -> LazyFrame
pl.scan_ndjson(source: str, ...) -> LazyFrame
pl.scan_ipc(source: str, ...) -> LazyFrame
Import
import polars as pl
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| source | str | Yes | File path, URL, glob pattern, or cloud storage URI (s3://, az://, gs://, hf://) |
| try_parse_dates | bool | No | Attempt to parse date columns automatically (CSV reader) |
| hive_partitioning | bool | No | Enable Hive-style partition discovery for partitioned datasets (Parquet scanner) |
| sheet_name | str | No | Name of the worksheet to read (Excel reader) |
| storage_options | dict | No | Cloud storage authentication credentials |
| credential_provider | CredentialProvider | No | Managed credential provider for cloud access |
| query | str | Yes (database) | SQL query string for database reads |
| uri | str | Yes (database) | Database connection URI for database reads |
Outputs
| Name | Type | Description |
|---|---|---|
| DataFrame | polars.DataFrame | Eagerly loaded tabular data (from read_* functions) |
| LazyFrame | polars.LazyFrame | Deferred query plan for lazy evaluation (from scan_* functions); call .collect() to materialize |
Usage Examples
import polars as pl
# --- Eager reads ---
# Read a CSV file
df = pl.read_csv("data.csv")
# Read a Parquet file
df = pl.read_parquet("data.parquet")
# Read JSON
df = pl.read_json("data.json")
# Read NDJSON (newline-delimited JSON)
df = pl.read_ndjson("data.ndjson")
# --- Lazy scans ---
# Scan a CSV (creates LazyFrame, no data read yet)
lf = pl.scan_csv("data.csv")
# Scan Parquet files from S3 with glob
lf = pl.scan_parquet("s3://bucket/*.parquet")
# Scan Hive-partitioned Parquet dataset
lf = pl.scan_parquet("dataset/", hive_partitioning=True)
# Read from Hugging Face Hub
df = pl.read_parquet("hf://datasets/org/repo/data.parquet")
# Collect a lazy scan with filters (enables predicate pushdown)
df = (
pl.scan_parquet("s3://bucket/large_dataset/*.parquet")
.filter(pl.col("date") > "2025-01-01")
.select("id", "date", "value")
.collect()
)