Implementation:Huggingface Datasets Dataset Sort

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, ML_Preprocessing
Last Updated	2026-02-14 18:00 GMT

Overview

Concrete tool for sorting dataset rows by one or more columns provided by the HuggingFace Datasets library.

Description

The sort method creates a new dataset with rows sorted according to the values in one or more columns. It supports ascending and descending sort directions per column, configurable null value placement (at the beginning or end), and caching of the computed sort indices. Internally, it uses PyArrow's sort_indices function to compute the sort order and then applies select to reorder the rows via an indices mapping. Multi-column sorting is supported by passing a list of column names with a corresponding list of boolean reverse flags.

Usage

Use Dataset.sort when you need to order data by column values, such as sorting by sequence length for efficient batching, organizing data chronologically, or creating deterministic orderings for reproducible experiments.

Code Reference

Source Location

Repository: datasets
File: src/datasets/arrow_dataset.py
Lines: L4375-L4501

Signature

@transmit_format
@fingerprint_transform(inplace=False, ignore_kwargs=["load_from_cache_file", "indices_cache_file_name"])
def sort(
    self,
    column_names: Union[str, Sequence[str]],
    reverse: Union[bool, Sequence[bool]] = False,
    null_placement: str = "at_end",
    keep_in_memory: bool = False,
    load_from_cache_file: Optional[bool] = None,
    indices_cache_file_name: Optional[str] = None,
    writer_batch_size: Optional[int] = 1000,
    new_fingerprint: Optional[str] = None,
) -> "Dataset":

Import

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
sorted_ds = ds.sort("label")

I/O Contract

Inputs

Name	Type	Required	Description
column_names	`Union[str, Sequence[str]]`	Yes	Column name(s) to sort by.
reverse	`Union[bool, Sequence[bool]]`	No	Sort in descending order. A single bool applies to all columns; a list applies per column. Defaults to `False`.
null_placement	`str`	No	Place null values `"at_start"`/`"first"` or `"at_end"`/`"last"`. Defaults to `"at_end"`.
keep_in_memory	`bool`	No	Keep sorted indices in memory. Defaults to `False`.
load_from_cache_file	`Optional[bool]`	No	Use cached sorted indices if available.
indices_cache_file_name	`Optional[str]`	No	Cache file path for sorted indices.
writer_batch_size	`Optional[int]`	No	Rows per write operation. Defaults to 1000.
new_fingerprint	`Optional[str]`	No	The new fingerprint after transform.

Outputs

Name	Type	Description
return	`Dataset`	A new dataset sorted according to the specified column(s).

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
print(ds["label"][:10])
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

# Sort by label (ascending)
sorted_ds = ds.sort("label")
print(sorted_ds["label"][:10])
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

# Multi-column sort with mixed directions
another_sorted = ds.sort(["label", "text"], reverse=[True, False])
print(another_sorted["label"][:10])
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment