Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Duckdb Duckdb Adaptive Sorting

From Leeroopedia


Knowledge Sources
Domains Sorting, Algorithm
Last Updated 2026-02-07 12:00 GMT

Overview

A collection of high-performance sorting algorithms that adapt their behavior to the structure of input data, combining quicksort, radix sort, and merge sort strategies to achieve optimal performance across diverse data distributions.

Description

Adaptive sorting algorithms dynamically adjust their strategy based on properties of the input data, achieving better practical performance than algorithms that use a fixed strategy. DuckDB employs three complementary sorting approaches, each optimized for different scenarios.

Pattern-Defeating Quicksort (PDQSort) is a hybrid sorting algorithm that combines quicksort with insertion sort and heapsort. It detects and exploits patterns in the input: if the data is already nearly sorted, it performs fewer comparisons; if it detects an adversarial pattern (one that would cause O(n^2) quicksort behavior), it switches to heapsort to guarantee O(n log n) worst-case performance. PDQSort uses a novel pivot selection scheme and branch-avoidance techniques to outperform std::sort on modern hardware.

SKA Sort is a radix sort implementation optimized for modern CPUs. Unlike comparison-based sorts which have an O(n log n) lower bound, radix sort operates in O(n * k) time where k is the key size. SKA Sort uses American Flag Sort (in-place MSD radix sort) with optimizations for branch prediction and cache locality, making it the fastest option for sorting integer keys and fixed-width string prefixes.

VergeSort is an adaptive merge sort that detects existing runs (ascending or descending subsequences) in the input. By identifying and preserving these natural runs, it avoids redundant work on partially sorted data, approaching O(n) performance on nearly-sorted inputs while maintaining O(n log n) worst-case behavior.

Usage

DuckDB uses these sorting algorithms throughout its query execution engine. ORDER BY clauses, window function PARTITION BY and ORDER BY, merge joins, and top-N queries all rely on efficient sorting. PDQSort serves as the general-purpose comparison sort. SKA Sort handles integer and fixed-width key sorting in the radix-partitioned sort used for large ORDER BY operations. VergeSort is used when data may already be partially ordered, such as when sorting pre-grouped results.

Theoretical Basis

PDQSort: Pattern-defeating quicksort with fallback strategies:

function pdqsort(array, begin, end, bad_allowed):
    while end - begin > insertion_sort_threshold:
        // Choose pivot using median-of-3 or Tukey's ninther
        pivot = choose_pivot(array, begin, end)

        // Partition
        (pivot_pos, already_partitioned) = partition(array, begin, end, pivot)

        // Detect bad partitions (highly unbalanced)
        if not balanced(pivot_pos, begin, end):
            bad_allowed -= 1
            if bad_allowed == 0:
                heapsort(array, begin, end)   // guaranteed O(n log n)
                return

        // Detect patterns
        if already_partitioned:
            // Input may be sorted; try insertion sort
            if partial_insertion_sort(array, begin, pivot_pos):
                partial_insertion_sort(array, pivot_pos + 1, end)
                return

        // Recurse on smaller partition, loop on larger
        pdqsort(array, begin, pivot_pos, bad_allowed)
        begin = pivot_pos + 1
    insertion_sort(array, begin, end)

SKA Sort (American Flag Sort): In-place MSD radix sort:

function ska_sort(array, begin, end, byte_index):
    if end - begin < threshold:
        pdqsort(array, begin, end)
        return

    // Count occurrences of each byte value
    counts[256] = count_byte_values(array, begin, end, byte_index)

    // Compute bucket boundaries
    offsets = prefix_sum(counts)

    // Permute elements into buckets (in-place)
    for each bucket b:
        while offsets[b] < offsets[b + 1]:
            target = get_byte(array[offsets[b]], byte_index)
            if target == b: offsets[b]++
            else: swap(array[offsets[b]], array[offsets[target]])

    // Recurse on each non-empty bucket
    for each bucket b with count > 1:
        ska_sort(bucket_b, byte_index + 1)

VergeSort: Adaptive merge sort exploiting existing runs:

function vergesort(array):
    // Phase 1: Detect natural runs
    runs = []
    pos = 0
    while pos < len(array):
        run_end = find_run(array, pos)   // ascending or descending
        if descending: reverse(array[pos..run_end])
        runs.append((pos, run_end))
        pos = run_end

    // Phase 2: Merge runs (bottom-up merge sort)
    while len(runs) > 1:
        new_runs = []
        for i in 0..len(runs)-1 step 2:
            if i + 1 < len(runs):
                merge(runs[i], runs[i+1])
                new_runs.append(merged)
            else:
                new_runs.append(runs[i])
        runs = new_runs

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment