Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Lance Builder

From Leeroopedia
Knowledge Sources
Domains Data_Loading, Vector_Database
Last Updated 2026-02-14 18:00 GMT

Overview

Packaged dataset builder for loading Lance columnar format datasets into Arrow-backed datasets provided by the HuggingFace Datasets library.

Description

Lance is a packaged dataset builder extending both ArrowBasedBuilder and _CountableBuilderMixin that reads datasets stored in the Lance columnar format. Lance is a modern columnar data format optimized for machine learning workloads, especially those involving vector data. The builder is configured via LanceConfig, a dataclass extending BuilderConfig, with fields for features, columns (column subset selection), batch_size (default 256), and token (optional HF authentication token).

The builder supports two data organization modes: (1) Lance datasets with transaction/index/version metadata directories, which are opened via lance.dataset() and iterated by fragment; and (2) standalone Lance files, which are read individually via lance.file.LanceFileReader. When features are not explicitly provided, the builder infers them from the Arrow schema and inspects the first bytes of binary columns to detect media types (images, audio, video) using magic byte signatures, automatically mapping them to the appropriate Image, Audio, or Video feature types.

The module includes helper functions resolve_dataset_uris (finds dataset root directories from metadata file paths), _fix_hf_uri (strips revision tags from HF URIs), and _fix_local_version_file (resolves symlinks in version files).

Usage

Use Lance via load_dataset("lance", data_files=...) to load Lance format datasets. The builder currently requires the main branch when loading from the Hugging Face Hub and raises NotImplementedError for other revisions.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/packaged_modules/lance/lance.py
  • Lines: 1-235

Signature

@dataclass
class LanceConfig(datasets.BuilderConfig):
    features: Optional[datasets.Features] = None
    columns: Optional[List[str]] = None
    batch_size: Optional[int] = 256
    token: Optional[str] = None


class Lance(datasets.ArrowBasedBuilder, datasets.builder._CountableBuilderMixin):
    BUILDER_CONFIG_CLASS = LanceConfig
    METADATA_EXTENSIONS = [".idx", ".txn", ".manifest"]

    def _info(self):
    def _split_generators(self, dl_manager):
    def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    def _generate_shards(self, fragments, lance_files_paths, lance_files):
    def _generate_num_examples(self, fragments, lance_files_paths, lance_files):
    def _generate_tables(self, fragments, lance_files_paths, lance_files):

Import

from datasets.packaged_modules.lance.lance import Lance, LanceConfig

I/O Contract

Inputs (LanceConfig)

Name Type Required Description
features Optional[Features] No Schema describing the dataset features. If None, features are inferred from the Lance schema with automatic media type detection.
columns Optional[List[str]] No List of column names to load. Other columns are ignored. Loads all columns by default.
batch_size Optional[int] No Number of rows per RecordBatch during iteration. Defaults to 256.
token Optional[str] No HF authentication token for downloading datasets from the Hugging Face Hub.

Outputs

Name Type Description
dataset Dataset An Arrow-backed Dataset constructed from the Lance format data, with automatic feature type inference for binary columns containing images, audio, or video.

Magic Byte Detection

The builder detects media types in binary columns by inspecting the first 16 bytes:

Magic Bytes Extension Feature Type
89 50 4E 47 .png Image
FF D8 .jpg Image
49 49 .tif Image
47 49 46 38 .gif Image
1A 45 DF A3 .mkv Video
66 74 79 70 69 73 6F 6D .mp4 Video
52 49 46 46 .avi / .wav Video / Audio
49 44 33 .mp3 Audio
66 4C 61 43 .flac Audio

Usage Examples

Basic Usage

from datasets import load_dataset

# Load a Lance dataset
ds = load_dataset("lance", data_files="data/dataset.lance/**", split="train")
print(ds[0])

Loading Specific Columns

from datasets import load_dataset

# Load only specific columns from a Lance dataset
ds = load_dataset(
    "lance",
    data_files="data/embeddings.lance/**",
    columns=["id", "embedding"],
    split="train",
)
print(ds.column_names)  # ["id", "embedding"]

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment