Heuristic:Apache Hudi Data Skipping Limitations
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Query_Performance |
| Last Updated | 2026-02-08 20:00 GMT |
Overview
Data skipping via Column Stats Index only applies to top-level columns, is incomplete for asynchronously clustered files, and treats missing columns as all-null for statistics.
Description
Data skipping allows Hudi to eliminate entire file groups from scan based on column-level min/max/null statistics stored in the Column Stats Index (CSI). However, this optimization has several non-obvious limitations. The CSI only stores statistics for top-level columns, so expressions on nested struct fields (e.g., struct.field > 0) cannot be pruned. The CSI is populated during clustering, which runs asynchronously and may not cover all base files. Missing column statistics are treated as all-null, which prevents pruning for those files. Additionally, schema evolution and config changes can create gaps in indexed columns across different files.
Usage
Apply this heuristic when relying on data skipping for query performance in Hudi tables. Understanding these limitations helps set realistic performance expectations and guides schema design decisions (e.g., avoiding deeply nested filter columns).
The Insight (Rule of Thumb)
- Top-level only: Only filter expressions on top-level columns benefit from data skipping. Nested field filters like
struct.field > 0are ineffective. - Metadata required: Data skipping requires
metadata.enabled=true. Without the metadata table, skipping is disabled entirely. - Incomplete coverage: CSI coverage depends on clustering. Newly written files that have not been clustered will not have column stats and cannot be skipped.
- Missing = null: Files without statistics for a column are assumed to have all-null values, which prevents any pruning for that column.
- Column alignment: After schema evolution, different files may have different sets of indexed columns, requiring careful alignment during query planning.
Reasoning
The Column Stats Index is bound to the clustering process because column statistics are generated when files are rewritten during clustering. Files that have only been through the streaming write path or compaction will not have CSI entries. This design is intentional: generating column stats on every write would add significant overhead to the write path.
The top-level column restriction exists because Parquet column statistics are tracked at the leaf column level in the file format, but Hudi's CSI aggregates these at the top-level column level. Sub-column statistics would require a different index structure.
The null-assumption for missing columns is a safe default: assuming null values means no data is incorrectly skipped (false negatives), at the cost of reading some files unnecessarily (false positives).
Code Evidence
Data skipping limitation note from FileIndex.java:257-262:
// NOTE: Data Skipping is only effective when it references columns that are indexed w/in
// the Column Stats Index (CSI). Following cases could not be effectively handled
// by Data Skipping:
// - Expressions on top-level column's fields (ie, for ex filters like
// "struct.field > 0", since CSI only contains stats for top-level columns)
// - Any expression not directly referencing top-level column
Incomplete CSI coverage note from FileStatsIndex.java:258-262:
// NOTE: We have to collect list of indexed columns to make sure we properly align the rows
// w/in the transposed dataset: since some files might not have all the columns indexed
// either due to the Column Stats Index config changes, schema evolution, etc. we have
// to make sure that all the rows w/in transposed data-frame are properly padded (with
// null values) for such file-column combinations
Null assumption from FileStatsIndex.java:389-391:
// NOTE: Since we're assuming missing column to essentially contain exclusively
// null values, we set null-count to be equal to value-count