Implementation:Huggingface Datasets Lance Builder
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Vector_Database |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Packaged dataset builder for loading Lance columnar format datasets into Arrow-backed datasets provided by the HuggingFace Datasets library.
Description
Lance is a packaged dataset builder extending both ArrowBasedBuilder and _CountableBuilderMixin that reads datasets stored in the Lance columnar format. Lance is a modern columnar data format optimized for machine learning workloads, especially those involving vector data. The builder is configured via LanceConfig, a dataclass extending BuilderConfig, with fields for features, columns (column subset selection), batch_size (default 256), and token (optional HF authentication token).
The builder supports two data organization modes: (1) Lance datasets with transaction/index/version metadata directories, which are opened via lance.dataset() and iterated by fragment; and (2) standalone Lance files, which are read individually via lance.file.LanceFileReader. When features are not explicitly provided, the builder infers them from the Arrow schema and inspects the first bytes of binary columns to detect media types (images, audio, video) using magic byte signatures, automatically mapping them to the appropriate Image, Audio, or Video feature types.
The module includes helper functions resolve_dataset_uris (finds dataset root directories from metadata file paths), _fix_hf_uri (strips revision tags from HF URIs), and _fix_local_version_file (resolves symlinks in version files).
Usage
Use Lance via load_dataset("lance", data_files=...) to load Lance format datasets. The builder currently requires the main branch when loading from the Hugging Face Hub and raises NotImplementedError for other revisions.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/packaged_modules/lance/lance.py - Lines: 1-235
Signature
@dataclass
class LanceConfig(datasets.BuilderConfig):
features: Optional[datasets.Features] = None
columns: Optional[List[str]] = None
batch_size: Optional[int] = 256
token: Optional[str] = None
class Lance(datasets.ArrowBasedBuilder, datasets.builder._CountableBuilderMixin):
BUILDER_CONFIG_CLASS = LanceConfig
METADATA_EXTENSIONS = [".idx", ".txn", ".manifest"]
def _info(self):
def _split_generators(self, dl_manager):
def _cast_table(self, pa_table: pa.Table) -> pa.Table:
def _generate_shards(self, fragments, lance_files_paths, lance_files):
def _generate_num_examples(self, fragments, lance_files_paths, lance_files):
def _generate_tables(self, fragments, lance_files_paths, lance_files):
Import
from datasets.packaged_modules.lance.lance import Lance, LanceConfig
I/O Contract
Inputs (LanceConfig)
| Name | Type | Required | Description |
|---|---|---|---|
| features | Optional[Features] |
No | Schema describing the dataset features. If None, features are inferred from the Lance schema with automatic media type detection. |
| columns | Optional[List[str]] |
No | List of column names to load. Other columns are ignored. Loads all columns by default. |
| batch_size | Optional[int] |
No | Number of rows per RecordBatch during iteration. Defaults to 256. |
| token | Optional[str] |
No | HF authentication token for downloading datasets from the Hugging Face Hub. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset |
An Arrow-backed Dataset constructed from the Lance format data, with automatic feature type inference for binary columns containing images, audio, or video. |
Magic Byte Detection
The builder detects media types in binary columns by inspecting the first 16 bytes:
| Magic Bytes | Extension | Feature Type |
|---|---|---|
89 50 4E 47 |
.png | Image |
FF D8 |
.jpg | Image |
49 49 |
.tif | Image |
47 49 46 38 |
.gif | Image |
1A 45 DF A3 |
.mkv | Video |
66 74 79 70 69 73 6F 6D |
.mp4 | Video |
52 49 46 46 |
.avi / .wav | Video / Audio |
49 44 33 |
.mp3 | Audio |
66 4C 61 43 |
.flac | Audio |
Usage Examples
Basic Usage
from datasets import load_dataset
# Load a Lance dataset
ds = load_dataset("lance", data_files="data/dataset.lance/**", split="train")
print(ds[0])
Loading Specific Columns
from datasets import load_dataset
# Load only specific columns from a Lance dataset
ds = load_dataset(
"lance",
data_files="data/embeddings.lance/**",
columns=["id", "embedding"],
split="train",
)
print(ds.column_names) # ["id", "embedding"]