Principle:Huggingface Datasets Dataset Item Access
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Dataset Item Access is the mechanism for retrieving individual examples, slices, or columns from a loaded dataset using index-based or name-based lookups.
Description
Once a dataset is loaded into memory as a Dataset object backed by an Arrow table, users need efficient ways to access its contents. Dataset Item Access provides a unified interface that supports multiple access patterns through Python's subscript operator ([]):
- Single row access: Using an integer index returns a dictionary mapping column names to values for that row.
- Slice access: Using a Python slice (e.g.,
[10:20]) returns a dictionary of lists, one list per column, covering the specified row range. - Batch access: Using a list or array of integer indices returns a dictionary of lists for the selected rows.
- Column access: Using a string column name returns a
Columnobject (or a list of values) for that column across all rows. - Boolean indexing: Using a list or array of booleans selects rows where the value is
True.
The access mechanism is format-aware. The Dataset object can be configured (via set_format) to return data as Python objects, NumPy arrays, PyTorch tensors, TensorFlow tensors, JAX arrays, Arrow tables, Pandas DataFrames, or Polars DataFrames. The item access layer applies the appropriate formatting transformation after querying the underlying Arrow table.
The architecture also supports indexed datasets where an indices mapping redirects logical indices to physical row positions, enabling efficient subset views without copying data.
Usage
Apply Dataset Item Access when:
- Retrieving a single example for inspection or debugging.
- Selecting a batch of examples for model input during training or evaluation.
- Extracting a specific column (e.g., labels) for analysis.
- Slicing a dataset to take a quick look at a range of examples.
- Working with a formatted dataset where output must be in a specific tensor framework format.
Theoretical Basis
The access mechanism operates in two layers:
__getitem__(key):
If key is a string:
Return Column(dataset, key) for lazy column access
Else:
Delegate to _getitem(key)
_getitem(key):
1. DETERMINE format settings (type, columns, kwargs)
2. CREATE formatter for the configured output format
3. QUERY Arrow table:
- If indices mapping exists: translate logical indices to physical
- Apply key (int, slice, list, or bool mask) to Arrow table
- Return a sub-table (pa.Table)
4. FORMAT the sub-table:
- Select only the configured format_columns (if set)
- Convert Arrow data to the target format (dict, numpy, torch, etc.)
5. Return formatted output
The querying step leverages Arrow's zero-copy slicing for contiguous ranges, making slice-based access very efficient. For non-contiguous index lists, Arrow performs a take operation. The formatting step is applied lazily after the query to avoid unnecessary conversions.