Implementation:Apache Hudi FileIndex GetFilesInPartitions

Knowledge Sources	Apache Hudi
Domains	Data_Lake, Stream_Processing
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for enumerating and pruning files in a Hudi table's partitions using metadata-driven indexing, provided by Apache Hudi.

Description

The FileIndex class is a serializable file index that supports efficient file listing through the Hudi metadata table. It caches partition paths to avoid redundant lookups and provides multi-stage pruning through partition pruning, bucket pruning, record-level index pruning, and column statistics-based data skipping.

The getFilesInPartitions() method returns a flat list of StoragePathInfo objects representing all files in the pruned set of partitions. It first obtains or builds the list of partition paths (applying partition pruning if a PartitionPruner is configured), then delegates to FSUtils.getFilesInPartitions() to list the actual files in those partitions via the Hudi engine context and metadata configuration.

The companion method filterFileSlices() accepts a list of FileSlice objects and applies three additional pruning stages: bucket pruning (filtering by bucket ID encoded in file IDs), record-level index pruning (narrowing candidates via the record index), and column statistics pruning (skipping files whose column stats do not overlap with query predicates).

The ExpressionPredicates class converts Flink ResolvedExpression objects into internal Predicate representations that can be evaluated against column statistics. It supports comparison operators (=, !=, <, >, <=, >=), logical operators (AND, OR, NOT), IN lists, and null checks.

Usage

Use the FileIndex when constructing the Hudi Flink source to discover which files need to be read. It is typically built via its Builder and configured with:

The table base path and Hadoop configuration
A PartitionPruner derived from pushed-down partition filters
A ColumnStatsProbe containing evaluators for data skipping
A bucket ID function for bucket-indexed tables
The table's RowType for column stats interpretation

Code Reference

Source Location

Repository: Apache Hudi
File: hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java
Lines: 60-290
Also: hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java (Lines 50-400)

Signature

/**
 * Return all files in the filtered partitions.
 */
public List<StoragePathInfo> getFilesInPartitions() {
    if (!tableExists) {
        return Collections.emptyList();
    }
    String[] partitions =
        getOrBuildPartitionPaths().stream()
            .map(p -> fullPartitionPath(path, p))
            .toArray(String[]::new);
    if (partitions.length < 1) {
        return Collections.emptyList();
    }
    Map<String, List<StoragePathInfo>> filesInPartitions = FSUtils.getFilesInPartitions(
        new HoodieFlinkEngineContext(hadoopConf), metaClient, metadataConfig, partitions);
    return filesInPartitions.values().stream()
        .flatMap(Collection::stream)
        .collect(Collectors.toList());
}

/**
 * Filter file slices by pruning based on bucket id and column stats.
 */
public List<FileSlice> filterFileSlices(List<FileSlice> fileSlices)

Import

import org.apache.hudi.source.FileIndex;
import org.apache.hudi.source.ExpressionPredicates;
import org.apache.hudi.common.model.FileSlice;
import org.apache.hudi.storage.StoragePathInfo;

I/O Contract

Inputs

Name	Type	Required	Description
path	`StoragePath`	Yes (constructor)	Base path of the Hudi table on storage.
conf	`Configuration`	Yes (constructor via Builder)	Flink configuration with Hudi connector options, including `READ_DATA_SKIPPING_ENABLED` and metadata table settings.
rowType	`RowType`	Yes (constructor via Builder)	Flink logical row type for the table, used for column stats interpretation.
metaClient	`HoodieTableMetaClient`	Yes (constructor via Builder)	Hudi table metadata client for timeline and metadata table access.
colStatsProbe	`ColumnStatsProbe`	No (constructor via Builder)	Evaluators for column statistics-based data skipping. Only effective when data skipping is enabled and metadata table is active.
partitionPruner	`PartitionPruners.PartitionPruner`	No (constructor via Builder)	Partition pruner derived from pushed-down filter predicates on partition columns.
partitionBucketIdFunc	`Function<String, Integer>`	No (constructor via Builder)	Function that maps a partition path to a target bucket ID for bucket pruning.

Outputs

Name	Type	Description
return value (getFilesInPartitions)	`List<StoragePathInfo>`	Flat list of storage path information objects for all files in the pruned partitions. Empty list if the table does not exist or has no matching partitions.
return value (filterFileSlices)	`List<FileSlice>`	Pruned list of file slices after applying bucket pruning, record-level index pruning, and column statistics pruning.

Usage Examples

import org.apache.hudi.source.FileIndex;
import org.apache.hudi.common.model.FileSlice;
import org.apache.hudi.storage.StoragePathInfo;

// Build the FileIndex with partition pruning and data skipping
FileIndex fileIndex = FileIndex.builder()
    .path(tablePath)
    .conf(conf)
    .rowType(rowType)
    .metaClient(metaClient)
    .partitionPruner(partitionPruner)
    .colStatsProbe(columnStatsProbe)
    .build();

// Step 1: Discover files in pruned partitions
List<StoragePathInfo> files = fileIndex.getFilesInPartitions();
// files contains only files in partitions that match the partition predicate

// Step 2: Given file slices from the file system view, apply further pruning
List<FileSlice> allSlices = getFileSlicesFromView();
List<FileSlice> prunedSlices = fileIndex.filterFileSlices(allSlices);
// prunedSlices has bucket-pruned, record-index-pruned, and column-stats-pruned slices

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment