Implementation:Huggingface Datasets Dataset getitem

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for accessing individual examples, slices, or columns from a loaded dataset provided by the HuggingFace Datasets library.

Description

Dataset.__getitem__ implements Python's subscript operator to provide flexible access to dataset contents. It dispatches based on the key type: string keys return a Column object for lazy column access (unless the format is set to arrow, pandas, or polars, in which case it returns the formatted column directly), while integer, slice, iterable, or boolean keys are delegated to the internal _getitem method. The _getitem method queries the underlying Arrow table using query_table (which handles indices mappings for subset views), creates a format-specific formatter, and applies format_table to convert the Arrow data into the configured output format (Python dicts, NumPy arrays, PyTorch tensors, etc.).

Usage

Use dataset[key] whenever you need to retrieve data from a loaded Dataset. This is the primary data access interface and is used extensively in training loops, data inspection, and preprocessing pipelines.

Code Reference

Source Location

Repository: datasets
File: src/datasets/arrow_dataset.py
Lines: L2844-L2877 (_getitem and __getitem__)

Signature

def _getitem(self, key: Union[int, slice, str, ListLike[int]], **kwargs) -> Union[dict, list]:

@overload
def __getitem__(self, key: Union[int, slice, Iterable[int]]) -> dict: ...

@overload
def __getitem__(self, key: str) -> list: ...

def __getitem__(self, key):

Import

from datasets import load_dataset
ds = load_dataset("dataset_name", split="train")
# Access via subscript operator:
example = ds[0]

I/O Contract

Inputs

Name	Type	Required	Description
key	`int`	Yes (one of)	Returns a single example as a dictionary mapping column names to values.
key	`slice`	Yes (one of)	Returns a dictionary of lists for the specified row range.
key	`Iterable[int]`	Yes (one of)	Returns a dictionary of lists for the specified row indices.
key	`str`	Yes (one of)	Returns a `Column` object (or formatted column) for the named column.
key	`Iterable[bool]`	Yes (one of)	Returns a dictionary of lists for rows where the mask is `True`.

Outputs

Name	Type	Description
(return for int key)	`dict`	A dictionary mapping column names to scalar values for the single example.
(return for slice/list key)	`dict`	A dictionary mapping column names to lists of values for the selected rows.
(return for str key)	`Column` or `list`	A `Column` object for lazy access (default format), or a list/formatted object when a specific format is set.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")

# Access a single example by integer index
example = ds[0]
# {'text': 'the rock is destined to be ...', 'label': 1}

# Access a slice of examples
batch = ds[10:15]
# {'text': ['...', '...', '...', '...', '...'], 'label': [1, 0, 1, 0, 1]}

# Access a column by name
labels = ds["label"]
# Column([1, 0, 1, ...])

# Access specific rows by list of indices
selected = ds[[0, 5, 10]]
# {'text': ['...', '...', '...'], 'label': [1, 0, 1]}

Formatted Access

import torch
from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
ds.set_format("torch", columns=["label"])

# Now __getitem__ returns PyTorch tensors
example = ds[0]
# {'label': tensor(1)}

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Dataset_Item_Access

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment