Principle:Apache Hudi Split Discovery And Pruning

Knowledge Sources	Apache Hudi
Domains	Data_Lake, Stream_Processing
Last Updated	2026-02-08 00:00 GMT

Overview

Identifying which files on storage are relevant to a query and eliminating irrelevant files before any data is read.

Description

After the query type has been determined and the source operator topology has been selected, the next step is split discovery: enumerating the files that compose the table and determining which ones contain data relevant to the query. This is followed by pruning: eliminating files that provably cannot contain matching records, thereby reducing I/O.

Split discovery and pruning operates at multiple granularity levels:

Partition pruning: If the table is partitioned and the query contains predicates on partition columns, entire partitions can be excluded without inspecting individual files. This leverages the directory structure of the table (e.g., dt=2024-01-01/).
Bucket pruning: For bucket-indexed tables, the file ID encodes the bucket assignment. If the query targets a specific key, only files belonging to the matching bucket need to be read.
Record-level index pruning: If a record-level index is available, it can narrow down the candidate file slices to those that contain the specific records of interest.
Column statistics pruning (data skipping): File-level column statistics (min/max values, null counts) stored in the metadata table allow the system to skip files whose value ranges do not overlap with the query predicates. This is the finest-grained pruning available without reading the actual data.

The result of this step is a pruned set of file slices -- each representing a base Parquet file and its associated log files (for merge-on-read tables) within a specific partition.

Usage

Use split discovery and pruning whenever a query touches a subset of the data in a large table. It is most effective when:

The table is partitioned and queries commonly filter on partition columns
The metadata table is enabled, providing column statistics and record-level indexes
The table uses bucket indexing and queries filter on the key column
Data skipping is enabled (read.data-skipping.enabled = true) and the metadata table contains column stats

Even for full-table scans, the discovery step is necessary to enumerate all files; pruning simply becomes a no-op in that case.

Theoretical Basis

Split discovery and pruning implements a multi-stage filter cascade. Each stage progressively narrows the candidate set, and the order is chosen to maximize the elimination rate at minimal cost:

function discoverAndPrune(tablePath, partitionColumns, filters):
    // Stage 1: Enumerate all partition paths (metadata table or filesystem listing)
    allPartitions = listPartitions(tablePath)

    // Stage 2: Partition pruning
    if partitionPruner is available:
        candidatePartitions = partitionPruner.filter(allPartitions)
    else:
        candidatePartitions = allPartitions

    // Stage 3: List files in candidate partitions
    fileSlices = listFileSlices(candidatePartitions)

    // Stage 4: Bucket pruning
    if bucketIdFunction is available:
        fileSlices = filterByBucketId(fileSlices, bucketIdFunction)

    // Stage 5: Record-level index pruning
    if recordLevelIndex is available:
        fileSlices = recordLevelIndex.computeCandidates(fileSlices)

    // Stage 6: Column statistics pruning (data skipping)
    if columnStatsIndex is available AND dataSkippingEnabled:
        allFileNames = extractFileNames(fileSlices)
        candidateFileNames = columnStatsIndex.computeCandidates(colStatsProbe, allFileNames)
        fileSlices = filterByFileNames(fileSlices, candidateFileNames)

    return fileSlices

The theoretical foundation is predicate pushdown through index structures. Each pruning stage applies the query predicates against progressively finer-grained index structures (partition paths, bucket IDs, record indexes, column statistics). The key invariant is soundness: pruning may leave false positives (files that turn out to not contain matching records) but must never produce false negatives (files that do contain matching records must never be pruned).

The effectiveness of the cascade depends on selectivity: highly selective predicates on indexed columns yield the greatest benefit. The cost of each stage is proportional to the size of the index structure, not the size of the data, making it a sublinear operation.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment