Implementation:Eventual Inc Daft Read Parquet
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Analytics |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for reading Parquet files into a DataFrame provided by the Daft library.
Description
The read_parquet function creates a lazy DataFrame from one or more Apache Parquet files. It supports local paths, S3, GCS, and Azure Blob Storage with glob pattern matching. The function constructs a scan plan that defers actual data reading until an action is triggered, enabling predicate and projection pushdown optimizations. It also supports hive-style partitioning, custom schemas, row group selection, and Int96 timestamp coercion.
Usage
Import and use this function when you need to read Parquet files from local or remote storage into a Daft DataFrame.
Code Reference
Source Location
- Repository: Daft
- File:
daft/io/_parquet.py - Lines: L18-96
Signature
def read_parquet(
path: str | list[str],
row_groups: list[list[int]] | None = None,
infer_schema: bool = True,
schema: dict[str, DataType] | None = None,
io_config: IOConfig | None = None,
file_path_column: str | None = None,
hive_partitioning: bool = False,
coerce_int96_timestamp_unit: str | TimeUnit | None = None,
) -> DataFrame
Import
from daft import read_parquet
# or
import daft
daft.read_parquet(...)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | list[str] | Yes | Path to Parquet file(s). Supports wildcards and remote URLs (e.g., s3://, gs://).
|
| row_groups | None | No | List of row groups to read corresponding to each file. |
| infer_schema | bool | No | Whether to infer the schema from the Parquet metadata. Defaults to True.
|
| schema | None | No | Schema used as definitive (if infer_schema=False) or as a hint applied after inference.
|
| io_config | None | No | Configuration for the native downloader (S3, GCS, Azure credentials, etc.). |
| file_path_column | None | No | If set, includes the source file path as a column with this name. |
| hive_partitioning | bool | No | Whether to infer hive-style partitions from file paths. Defaults to False.
|
| coerce_int96_timestamp_unit | TimeUnit | None | No | TimeUnit to coerce Int96 timestamps to (e.g., ns, us, ms). Defaults to None.
|
Outputs
| Name | Type | Description |
|---|---|---|
| return | DataFrame | A lazy DataFrame with a scan plan over the Parquet data. No data is read until an action is triggered. |
Usage Examples
Basic Usage
import daft
# Read a single Parquet file
df = daft.read_parquet("/path/to/file.parquet")
# Read all Parquet files in a directory
df = daft.read_parquet("/path/to/directory")
# Read with glob pattern
df = daft.read_parquet("/path/to/files-*.parquet")
Reading from S3
import daft
from daft.io import S3Config, IOConfig
io_config = IOConfig(s3=S3Config(region="us-west-2", anonymous=True))
df = daft.read_parquet("s3://bucket/path/*.parquet", io_config=io_config)
df.show()