Implementation:Apache Hudi FileIndex GetFilesInPartitions
| Knowledge Sources | |
|---|---|
| Domains | Data_Lake, Stream_Processing |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for enumerating and pruning files in a Hudi table's partitions using metadata-driven indexing, provided by Apache Hudi.
Description
The FileIndex class is a serializable file index that supports efficient file listing through the Hudi metadata table. It caches partition paths to avoid redundant lookups and provides multi-stage pruning through partition pruning, bucket pruning, record-level index pruning, and column statistics-based data skipping.
The getFilesInPartitions() method returns a flat list of StoragePathInfo objects representing all files in the pruned set of partitions. It first obtains or builds the list of partition paths (applying partition pruning if a PartitionPruner is configured), then delegates to FSUtils.getFilesInPartitions() to list the actual files in those partitions via the Hudi engine context and metadata configuration.
The companion method filterFileSlices() accepts a list of FileSlice objects and applies three additional pruning stages: bucket pruning (filtering by bucket ID encoded in file IDs), record-level index pruning (narrowing candidates via the record index), and column statistics pruning (skipping files whose column stats do not overlap with query predicates).
The ExpressionPredicates class converts Flink ResolvedExpression objects into internal Predicate representations that can be evaluated against column statistics. It supports comparison operators (=, !=, <, >, <=, >=), logical operators (AND, OR, NOT), IN lists, and null checks.
Usage
Use the FileIndex when constructing the Hudi Flink source to discover which files need to be read. It is typically built via its Builder and configured with:
- The table base path and Hadoop configuration
- A
PartitionPrunerderived from pushed-down partition filters - A
ColumnStatsProbecontaining evaluators for data skipping - A bucket ID function for bucket-indexed tables
- The table's
RowTypefor column stats interpretation
Code Reference
Source Location
- Repository: Apache Hudi
- File:
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java - Lines: 60-290
- Also:
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java(Lines 50-400)
Signature
/**
* Return all files in the filtered partitions.
*/
public List<StoragePathInfo> getFilesInPartitions() {
if (!tableExists) {
return Collections.emptyList();
}
String[] partitions =
getOrBuildPartitionPaths().stream()
.map(p -> fullPartitionPath(path, p))
.toArray(String[]::new);
if (partitions.length < 1) {
return Collections.emptyList();
}
Map<String, List<StoragePathInfo>> filesInPartitions = FSUtils.getFilesInPartitions(
new HoodieFlinkEngineContext(hadoopConf), metaClient, metadataConfig, partitions);
return filesInPartitions.values().stream()
.flatMap(Collection::stream)
.collect(Collectors.toList());
}
/**
* Filter file slices by pruning based on bucket id and column stats.
*/
public List<FileSlice> filterFileSlices(List<FileSlice> fileSlices)
Import
import org.apache.hudi.source.FileIndex;
import org.apache.hudi.source.ExpressionPredicates;
import org.apache.hudi.common.model.FileSlice;
import org.apache.hudi.storage.StoragePathInfo;
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | StoragePath |
Yes (constructor) | Base path of the Hudi table on storage. |
| conf | Configuration |
Yes (constructor via Builder) | Flink configuration with Hudi connector options, including READ_DATA_SKIPPING_ENABLED and metadata table settings.
|
| rowType | RowType |
Yes (constructor via Builder) | Flink logical row type for the table, used for column stats interpretation. |
| metaClient | HoodieTableMetaClient |
Yes (constructor via Builder) | Hudi table metadata client for timeline and metadata table access. |
| colStatsProbe | ColumnStatsProbe |
No (constructor via Builder) | Evaluators for column statistics-based data skipping. Only effective when data skipping is enabled and metadata table is active. |
| partitionPruner | PartitionPruners.PartitionPruner |
No (constructor via Builder) | Partition pruner derived from pushed-down filter predicates on partition columns. |
| partitionBucketIdFunc | Function<String, Integer> |
No (constructor via Builder) | Function that maps a partition path to a target bucket ID for bucket pruning. |
Outputs
| Name | Type | Description |
|---|---|---|
| return value (getFilesInPartitions) | List<StoragePathInfo> |
Flat list of storage path information objects for all files in the pruned partitions. Empty list if the table does not exist or has no matching partitions. |
| return value (filterFileSlices) | List<FileSlice> |
Pruned list of file slices after applying bucket pruning, record-level index pruning, and column statistics pruning. |
Usage Examples
import org.apache.hudi.source.FileIndex;
import org.apache.hudi.common.model.FileSlice;
import org.apache.hudi.storage.StoragePathInfo;
// Build the FileIndex with partition pruning and data skipping
FileIndex fileIndex = FileIndex.builder()
.path(tablePath)
.conf(conf)
.rowType(rowType)
.metaClient(metaClient)
.partitionPruner(partitionPruner)
.colStatsProbe(columnStatsProbe)
.build();
// Step 1: Discover files in pruned partitions
List<StoragePathInfo> files = fileIndex.getFilesInPartitions();
// files contains only files in partitions that match the partition predicate
// Step 2: Given file slices from the file system view, apply further pruning
List<FileSlice> allSlices = getFileSlicesFromView();
List<FileSlice> prunedSlices = fileIndex.filterFileSlices(allSlices);
// prunedSlices has bucket-pruned, record-index-pruned, and column-stats-pruned slices