Implementation:Huggingface Datasets Dataset getitem
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for accessing individual examples, slices, or columns from a loaded dataset provided by the HuggingFace Datasets library.
Description
Dataset.__getitem__ implements Python's subscript operator to provide flexible access to dataset contents. It dispatches based on the key type: string keys return a Column object for lazy column access (unless the format is set to arrow, pandas, or polars, in which case it returns the formatted column directly), while integer, slice, iterable, or boolean keys are delegated to the internal _getitem method. The _getitem method queries the underlying Arrow table using query_table (which handles indices mappings for subset views), creates a format-specific formatter, and applies format_table to convert the Arrow data into the configured output format (Python dicts, NumPy arrays, PyTorch tensors, etc.).
Usage
Use dataset[key] whenever you need to retrieve data from a loaded Dataset. This is the primary data access interface and is used extensively in training loops, data inspection, and preprocessing pipelines.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L2844-L2877 (
_getitemand__getitem__)
Signature
def _getitem(self, key: Union[int, slice, str, ListLike[int]], **kwargs) -> Union[dict, list]:
@overload
def __getitem__(self, key: Union[int, slice, Iterable[int]]) -> dict: ...
@overload
def __getitem__(self, key: str) -> list: ...
def __getitem__(self, key):
Import
from datasets import load_dataset
ds = load_dataset("dataset_name", split="train")
# Access via subscript operator:
example = ds[0]
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| key | int |
Yes (one of) | Returns a single example as a dictionary mapping column names to values. |
| key | slice |
Yes (one of) | Returns a dictionary of lists for the specified row range. |
| key | Iterable[int] |
Yes (one of) | Returns a dictionary of lists for the specified row indices. |
| key | str |
Yes (one of) | Returns a Column object (or formatted column) for the named column.
|
| key | Iterable[bool] |
Yes (one of) | Returns a dictionary of lists for rows where the mask is True.
|
Outputs
| Name | Type | Description |
|---|---|---|
| (return for int key) | dict |
A dictionary mapping column names to scalar values for the single example. |
| (return for slice/list key) | dict |
A dictionary mapping column names to lists of values for the selected rows. |
| (return for str key) | Column or list |
A Column object for lazy access (default format), or a list/formatted object when a specific format is set.
|
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
# Access a single example by integer index
example = ds[0]
# {'text': 'the rock is destined to be ...', 'label': 1}
# Access a slice of examples
batch = ds[10:15]
# {'text': ['...', '...', '...', '...', '...'], 'label': [1, 0, 1, 0, 1]}
# Access a column by name
labels = ds["label"]
# Column([1, 0, 1, ...])
# Access specific rows by list of indices
selected = ds[[0, 5, 10]]
# {'text': ['...', '...', '...'], 'label': [1, 0, 1]}
Formatted Access
import torch
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
ds.set_format("torch", columns=["label"])
# Now __getitem__ returns PyTorch tensors
example = ds[0]
# {'label': tensor(1)}