Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset getitem

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for accessing individual examples, slices, or columns from a loaded dataset provided by the HuggingFace Datasets library.

Description

Dataset.__getitem__ implements Python's subscript operator to provide flexible access to dataset contents. It dispatches based on the key type: string keys return a Column object for lazy column access (unless the format is set to arrow, pandas, or polars, in which case it returns the formatted column directly), while integer, slice, iterable, or boolean keys are delegated to the internal _getitem method. The _getitem method queries the underlying Arrow table using query_table (which handles indices mappings for subset views), creates a format-specific formatter, and applies format_table to convert the Arrow data into the configured output format (Python dicts, NumPy arrays, PyTorch tensors, etc.).

Usage

Use dataset[key] whenever you need to retrieve data from a loaded Dataset. This is the primary data access interface and is used extensively in training loops, data inspection, and preprocessing pipelines.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L2844-L2877 (_getitem and __getitem__)

Signature

def _getitem(self, key: Union[int, slice, str, ListLike[int]], **kwargs) -> Union[dict, list]:

@overload
def __getitem__(self, key: Union[int, slice, Iterable[int]]) -> dict: ...

@overload
def __getitem__(self, key: str) -> list: ...

def __getitem__(self, key):

Import

from datasets import load_dataset
ds = load_dataset("dataset_name", split="train")
# Access via subscript operator:
example = ds[0]

I/O Contract

Inputs

Name Type Required Description
key int Yes (one of) Returns a single example as a dictionary mapping column names to values.
key slice Yes (one of) Returns a dictionary of lists for the specified row range.
key Iterable[int] Yes (one of) Returns a dictionary of lists for the specified row indices.
key str Yes (one of) Returns a Column object (or formatted column) for the named column.
key Iterable[bool] Yes (one of) Returns a dictionary of lists for rows where the mask is True.

Outputs

Name Type Description
(return for int key) dict A dictionary mapping column names to scalar values for the single example.
(return for slice/list key) dict A dictionary mapping column names to lists of values for the selected rows.
(return for str key) Column or list A Column object for lazy access (default format), or a list/formatted object when a specific format is set.

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")

# Access a single example by integer index
example = ds[0]
# {'text': 'the rock is destined to be ...', 'label': 1}

# Access a slice of examples
batch = ds[10:15]
# {'text': ['...', '...', '...', '...', '...'], 'label': [1, 0, 1, 0, 1]}

# Access a column by name
labels = ds["label"]
# Column([1, 0, 1, ...])

# Access specific rows by list of indices
selected = ds[[0, 5, 10]]
# {'text': ['...', '...', '...'], 'label': [1, 0, 1]}

Formatted Access

import torch
from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
ds.set_format("torch", columns=["label"])

# Now __getitem__ returns PyTorch tensors
example = ds[0]
# {'label': tensor(1)}

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment