Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Eventual Inc Daft Read Huggingface

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Machine_Learning
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for loading HuggingFace datasets into a DataFrame provided by the Daft library.

Description

The read_huggingface function creates a DataFrame from a HuggingFace Hub dataset repository. It first attempts to read the dataset as Parquet files using the hf://datasets/ protocol (the fast path), and falls back to the HuggingFace datasets library if Parquet files are not available. This dual-path strategy supports all public datasets and all private Parquet datasets on HuggingFace Hub.

Usage

Import and use this function when you need to load a HuggingFace dataset into a Daft DataFrame for distributed processing.

Code Reference

Source Location

  • Repository: Daft
  • File: daft/io/huggingface/__init__.py
  • Lines: L37-61

Signature

def read_huggingface(
    repo: str,
    io_config: IOConfig | None = None,
) -> DataFrame

Import

from daft import read_huggingface

# or
import daft
daft.read_huggingface(...)

I/O Contract

Inputs

Name Type Required Description
repo str Yes HuggingFace repository in the form username/dataset_name
io_config None No IO configuration for reading data (e.g., authentication tokens for private datasets)

Outputs

Name Type Description
return DataFrame A DataFrame containing the dataset rows. Lazy when using the Parquet path; materialized when using the datasets library fallback.

Usage Examples

Basic Usage

import daft

# Load a public HuggingFace dataset
df = daft.read_huggingface("username/dataset_name")
df.show()

With IO Configuration

import daft
from daft.io import IOConfig, HTTPConfig

# Load with custom IO configuration
io_config = IOConfig(http=HTTPConfig(bearer_token="hf_your_token_here"))
df = daft.read_huggingface("username/private_dataset", io_config=io_config)
df.show()

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment