Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Eventual Inc Daft Data Ingestion HuggingFace

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Machine_Learning
Last Updated 2026-02-08 00:00 GMT

Overview

Technique for loading datasets directly from HuggingFace Hub repositories into a distributed dataframe.

Description

HuggingFace Hub is one of the largest repositories of open datasets for machine learning. Daft provides a first-class integration for loading HuggingFace datasets into a distributed DataFrame using a Parquet-first approach with fallback strategy.

When read_huggingface is called, Daft first attempts to read the dataset as Parquet files directly from the HuggingFace Hub using the hf://datasets/ protocol prefix. This is the fast path because Parquet files can be lazily scanned with predicate and projection pushdown, avoiding the need to download entire datasets.

If Parquet files are not available for the dataset (indicated by a FileNotFoundError or a 400 HTTP status error), Daft falls back to using the HuggingFace datasets library. This fallback loads all splits of the dataset, concatenates them, converts the data to Arrow format, and wraps the result in a Daft DataFrame.

The lazy loading approach means that no data is actually read until an action (such as .collect() or .show()) is triggered, enabling efficient query planning and optimization.

Usage

Use this technique when you need to load public or private HuggingFace datasets for processing in a distributed data pipeline. It is particularly useful for:

  • Loading ML training and evaluation datasets for feature engineering
  • Processing large-scale NLP or vision datasets with Daft's parallel execution
  • Building data pipelines that combine HuggingFace datasets with other data sources

Theoretical Basis

This technique follows the data ingestion pattern of lazy scanning remote columnar files with schema inference. The key principles are:

  1. Lazy evaluation: The scan plan is constructed without reading data, deferring I/O to execution time.
  2. Columnar-first strategy: Parquet (a columnar format) is preferred because it supports predicate pushdown and projection pushdown, reading only the columns and rows needed.
  3. Graceful degradation: When the preferred format is unavailable, the system falls back to an alternative reader (the datasets library) to maximize dataset compatibility.
  4. Schema inference: The schema is automatically inferred from the underlying data format, eliminating the need for manual schema specification.
Pseudocode:
1. Construct hf:// path from repo identifier
2. Attempt lazy Parquet scan via read_parquet("hf://datasets/{repo}")
3. If FileNotFoundError or HTTP 400:
   a. Load dataset via HuggingFace datasets library
   b. Concatenate all splits
   c. Convert to Arrow table
   d. Wrap in Daft DataFrame
4. Return lazy DataFrame

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment