Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Eventual Inc Daft Read Parquet

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Analytics
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for reading Parquet files into a DataFrame provided by the Daft library.

Description

The read_parquet function creates a lazy DataFrame from one or more Apache Parquet files. It supports local paths, S3, GCS, and Azure Blob Storage with glob pattern matching. The function constructs a scan plan that defers actual data reading until an action is triggered, enabling predicate and projection pushdown optimizations. It also supports hive-style partitioning, custom schemas, row group selection, and Int96 timestamp coercion.

Usage

Import and use this function when you need to read Parquet files from local or remote storage into a Daft DataFrame.

Code Reference

Source Location

  • Repository: Daft
  • File: daft/io/_parquet.py
  • Lines: L18-96

Signature

def read_parquet(
    path: str | list[str],
    row_groups: list[list[int]] | None = None,
    infer_schema: bool = True,
    schema: dict[str, DataType] | None = None,
    io_config: IOConfig | None = None,
    file_path_column: str | None = None,
    hive_partitioning: bool = False,
    coerce_int96_timestamp_unit: str | TimeUnit | None = None,
) -> DataFrame

Import

from daft import read_parquet

# or
import daft
daft.read_parquet(...)

I/O Contract

Inputs

Name Type Required Description
path list[str] Yes Path to Parquet file(s). Supports wildcards and remote URLs (e.g., s3://, gs://).
row_groups None No List of row groups to read corresponding to each file.
infer_schema bool No Whether to infer the schema from the Parquet metadata. Defaults to True.
schema None No Schema used as definitive (if infer_schema=False) or as a hint applied after inference.
io_config None No Configuration for the native downloader (S3, GCS, Azure credentials, etc.).
file_path_column None No If set, includes the source file path as a column with this name.
hive_partitioning bool No Whether to infer hive-style partitions from file paths. Defaults to False.
coerce_int96_timestamp_unit TimeUnit | None No TimeUnit to coerce Int96 timestamps to (e.g., ns, us, ms). Defaults to None.

Outputs

Name Type Description
return DataFrame A lazy DataFrame with a scan plan over the Parquet data. No data is read until an action is triggered.

Usage Examples

Basic Usage

import daft

# Read a single Parquet file
df = daft.read_parquet("/path/to/file.parquet")

# Read all Parquet files in a directory
df = daft.read_parquet("/path/to/directory")

# Read with glob pattern
df = daft.read_parquet("/path/to/files-*.parquet")

Reading from S3

import daft
from daft.io import S3Config, IOConfig

io_config = IOConfig(s3=S3Config(region="us-west-2", anonymous=True))
df = daft.read_parquet("s3://bucket/path/*.parquet", io_config=io_config)
df.show()

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment