Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Dataset Sort

From Leeroopedia
Revision as of 12:58, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datasets_Dataset_Sort.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Engineering, ML_Preprocessing
Last Updated 2026-02-14 18:00 GMT

Overview

Concrete tool for sorting dataset rows by one or more columns provided by the HuggingFace Datasets library.

Description

The sort method creates a new dataset with rows sorted according to the values in one or more columns. It supports ascending and descending sort directions per column, configurable null value placement (at the beginning or end), and caching of the computed sort indices. Internally, it uses PyArrow's sort_indices function to compute the sort order and then applies select to reorder the rows via an indices mapping. Multi-column sorting is supported by passing a list of column names with a corresponding list of boolean reverse flags.

Usage

Use Dataset.sort when you need to order data by column values, such as sorting by sequence length for efficient batching, organizing data chronologically, or creating deterministic orderings for reproducible experiments.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/arrow_dataset.py
  • Lines: L4375-L4501

Signature

@transmit_format
@fingerprint_transform(inplace=False, ignore_kwargs=["load_from_cache_file", "indices_cache_file_name"])
def sort(
    self,
    column_names: Union[str, Sequence[str]],
    reverse: Union[bool, Sequence[bool]] = False,
    null_placement: str = "at_end",
    keep_in_memory: bool = False,
    load_from_cache_file: Optional[bool] = None,
    indices_cache_file_name: Optional[str] = None,
    writer_batch_size: Optional[int] = 1000,
    new_fingerprint: Optional[str] = None,
) -> "Dataset":

Import

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
sorted_ds = ds.sort("label")

I/O Contract

Inputs

Name Type Required Description
column_names Union[str, Sequence[str]] Yes Column name(s) to sort by.
reverse Union[bool, Sequence[bool]] No Sort in descending order. A single bool applies to all columns; a list applies per column. Defaults to False.
null_placement str No Place null values "at_start"/"first" or "at_end"/"last". Defaults to "at_end".
keep_in_memory bool No Keep sorted indices in memory. Defaults to False.
load_from_cache_file Optional[bool] No Use cached sorted indices if available.
indices_cache_file_name Optional[str] No Cache file path for sorted indices.
writer_batch_size Optional[int] No Rows per write operation. Defaults to 1000.
new_fingerprint Optional[str] No The new fingerprint after transform.

Outputs

Name Type Description
return Dataset A new dataset sorted according to the specified column(s).

Usage Examples

Basic Usage

from datasets import load_dataset

ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
print(ds["label"][:10])
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

# Sort by label (ascending)
sorted_ds = ds.sort("label")
print(sorted_ds["label"][:10])
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

# Multi-column sort with mixed directions
another_sorted = ds.sort(["label", "text"], reverse=[True, False])
print(another_sorted["label"][:10])
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment