Implementation:Eventual Inc Daft Read Huggingface
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Machine_Learning |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for loading HuggingFace datasets into a DataFrame provided by the Daft library.
Description
The read_huggingface function creates a DataFrame from a HuggingFace Hub dataset repository. It first attempts to read the dataset as Parquet files using the hf://datasets/ protocol (the fast path), and falls back to the HuggingFace datasets library if Parquet files are not available. This dual-path strategy supports all public datasets and all private Parquet datasets on HuggingFace Hub.
Usage
Import and use this function when you need to load a HuggingFace dataset into a Daft DataFrame for distributed processing.
Code Reference
Source Location
- Repository: Daft
- File:
daft/io/huggingface/__init__.py - Lines: L37-61
Signature
def read_huggingface(
repo: str,
io_config: IOConfig | None = None,
) -> DataFrame
Import
from daft import read_huggingface
# or
import daft
daft.read_huggingface(...)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| repo | str | Yes | HuggingFace repository in the form username/dataset_name
|
| io_config | None | No | IO configuration for reading data (e.g., authentication tokens for private datasets) |
Outputs
| Name | Type | Description |
|---|---|---|
| return | DataFrame | A DataFrame containing the dataset rows. Lazy when using the Parquet path; materialized when using the datasets library fallback. |
Usage Examples
Basic Usage
import daft
# Load a public HuggingFace dataset
df = daft.read_huggingface("username/dataset_name")
df.show()
With IO Configuration
import daft
from daft.io import IOConfig, HTTPConfig
# Load with custom IO configuration
io_config = IOConfig(http=HTTPConfig(bearer_token="hf_your_token_here"))
df = daft.read_huggingface("username/private_dataset", io_config=io_config)
df.show()