Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Hudi FileIndex GetFilesInPartitions

From Leeroopedia


Knowledge Sources
Domains Data_Lake, Stream_Processing
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for enumerating and pruning files in a Hudi table's partitions using metadata-driven indexing, provided by Apache Hudi.

Description

The FileIndex class is a serializable file index that supports efficient file listing through the Hudi metadata table. It caches partition paths to avoid redundant lookups and provides multi-stage pruning through partition pruning, bucket pruning, record-level index pruning, and column statistics-based data skipping.

The getFilesInPartitions() method returns a flat list of StoragePathInfo objects representing all files in the pruned set of partitions. It first obtains or builds the list of partition paths (applying partition pruning if a PartitionPruner is configured), then delegates to FSUtils.getFilesInPartitions() to list the actual files in those partitions via the Hudi engine context and metadata configuration.

The companion method filterFileSlices() accepts a list of FileSlice objects and applies three additional pruning stages: bucket pruning (filtering by bucket ID encoded in file IDs), record-level index pruning (narrowing candidates via the record index), and column statistics pruning (skipping files whose column stats do not overlap with query predicates).

The ExpressionPredicates class converts Flink ResolvedExpression objects into internal Predicate representations that can be evaluated against column statistics. It supports comparison operators (=, !=, <, >, <=, >=), logical operators (AND, OR, NOT), IN lists, and null checks.

Usage

Use the FileIndex when constructing the Hudi Flink source to discover which files need to be read. It is typically built via its Builder and configured with:

  • The table base path and Hadoop configuration
  • A PartitionPruner derived from pushed-down partition filters
  • A ColumnStatsProbe containing evaluators for data skipping
  • A bucket ID function for bucket-indexed tables
  • The table's RowType for column stats interpretation

Code Reference

Source Location

  • Repository: Apache Hudi
  • File: hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/FileIndex.java
  • Lines: 60-290
  • Also: hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/ExpressionPredicates.java (Lines 50-400)

Signature

/**
 * Return all files in the filtered partitions.
 */
public List<StoragePathInfo> getFilesInPartitions() {
    if (!tableExists) {
        return Collections.emptyList();
    }
    String[] partitions =
        getOrBuildPartitionPaths().stream()
            .map(p -> fullPartitionPath(path, p))
            .toArray(String[]::new);
    if (partitions.length < 1) {
        return Collections.emptyList();
    }
    Map<String, List<StoragePathInfo>> filesInPartitions = FSUtils.getFilesInPartitions(
        new HoodieFlinkEngineContext(hadoopConf), metaClient, metadataConfig, partitions);
    return filesInPartitions.values().stream()
        .flatMap(Collection::stream)
        .collect(Collectors.toList());
}
/**
 * Filter file slices by pruning based on bucket id and column stats.
 */
public List<FileSlice> filterFileSlices(List<FileSlice> fileSlices)

Import

import org.apache.hudi.source.FileIndex;
import org.apache.hudi.source.ExpressionPredicates;
import org.apache.hudi.common.model.FileSlice;
import org.apache.hudi.storage.StoragePathInfo;

I/O Contract

Inputs

Name Type Required Description
path StoragePath Yes (constructor) Base path of the Hudi table on storage.
conf Configuration Yes (constructor via Builder) Flink configuration with Hudi connector options, including READ_DATA_SKIPPING_ENABLED and metadata table settings.
rowType RowType Yes (constructor via Builder) Flink logical row type for the table, used for column stats interpretation.
metaClient HoodieTableMetaClient Yes (constructor via Builder) Hudi table metadata client for timeline and metadata table access.
colStatsProbe ColumnStatsProbe No (constructor via Builder) Evaluators for column statistics-based data skipping. Only effective when data skipping is enabled and metadata table is active.
partitionPruner PartitionPruners.PartitionPruner No (constructor via Builder) Partition pruner derived from pushed-down filter predicates on partition columns.
partitionBucketIdFunc Function<String, Integer> No (constructor via Builder) Function that maps a partition path to a target bucket ID for bucket pruning.

Outputs

Name Type Description
return value (getFilesInPartitions) List<StoragePathInfo> Flat list of storage path information objects for all files in the pruned partitions. Empty list if the table does not exist or has no matching partitions.
return value (filterFileSlices) List<FileSlice> Pruned list of file slices after applying bucket pruning, record-level index pruning, and column statistics pruning.

Usage Examples

import org.apache.hudi.source.FileIndex;
import org.apache.hudi.common.model.FileSlice;
import org.apache.hudi.storage.StoragePathInfo;

// Build the FileIndex with partition pruning and data skipping
FileIndex fileIndex = FileIndex.builder()
    .path(tablePath)
    .conf(conf)
    .rowType(rowType)
    .metaClient(metaClient)
    .partitionPruner(partitionPruner)
    .colStatsProbe(columnStatsProbe)
    .build();

// Step 1: Discover files in pruned partitions
List<StoragePathInfo> files = fileIndex.getFilesInPartitions();
// files contains only files in partitions that match the partition predicate

// Step 2: Given file slices from the file system view, apply further pruning
List<FileSlice> allSlices = getFileSlicesFromView();
List<FileSlice> prunedSlices = fileIndex.filterFileSlices(allSlices);
// prunedSlices has bucket-pruned, record-index-pruned, and column-stats-pruned slices

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment