Implementation:Huggingface Datasets Dataset Sort
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ML_Preprocessing |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Concrete tool for sorting dataset rows by one or more columns provided by the HuggingFace Datasets library.
Description
The sort method creates a new dataset with rows sorted according to the values in one or more columns. It supports ascending and descending sort directions per column, configurable null value placement (at the beginning or end), and caching of the computed sort indices. Internally, it uses PyArrow's sort_indices function to compute the sort order and then applies select to reorder the rows via an indices mapping. Multi-column sorting is supported by passing a list of column names with a corresponding list of boolean reverse flags.
Usage
Use Dataset.sort when you need to order data by column values, such as sorting by sequence length for efficient batching, organizing data chronologically, or creating deterministic orderings for reproducible experiments.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/arrow_dataset.py - Lines: L4375-L4501
Signature
@transmit_format
@fingerprint_transform(inplace=False, ignore_kwargs=["load_from_cache_file", "indices_cache_file_name"])
def sort(
self,
column_names: Union[str, Sequence[str]],
reverse: Union[bool, Sequence[bool]] = False,
null_placement: str = "at_end",
keep_in_memory: bool = False,
load_from_cache_file: Optional[bool] = None,
indices_cache_file_name: Optional[str] = None,
writer_batch_size: Optional[int] = 1000,
new_fingerprint: Optional[str] = None,
) -> "Dataset":
Import
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
sorted_ds = ds.sort("label")
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| column_names | Union[str, Sequence[str]] |
Yes | Column name(s) to sort by. |
| reverse | Union[bool, Sequence[bool]] |
No | Sort in descending order. A single bool applies to all columns; a list applies per column. Defaults to False.
|
| null_placement | str |
No | Place null values "at_start"/"first" or "at_end"/"last". Defaults to "at_end".
|
| keep_in_memory | bool |
No | Keep sorted indices in memory. Defaults to False.
|
| load_from_cache_file | Optional[bool] |
No | Use cached sorted indices if available. |
| indices_cache_file_name | Optional[str] |
No | Cache file path for sorted indices. |
| writer_batch_size | Optional[int] |
No | Rows per write operation. Defaults to 1000. |
| new_fingerprint | Optional[str] |
No | The new fingerprint after transform. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | Dataset |
A new dataset sorted according to the specified column(s). |
Usage Examples
Basic Usage
from datasets import load_dataset
ds = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="validation")
print(ds["label"][:10])
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
# Sort by label (ascending)
sorted_ds = ds.sort("label")
print(sorted_ds["label"][:10])
# [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
# Multi-column sort with mixed directions
another_sorted = ds.sort(["label", "text"], reverse=[True, False])
print(another_sorted["label"][:10])
# [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]