Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Apache Hudi Record Level Index Optimization

From Leeroopedia




Knowledge Sources
Domains Optimization, Indexing
Last Updated 2026-02-08 20:00 GMT

Overview

Record Level Index (RLI) pruning silently skips when query predicates exceed 8 keys, minibatch size has a hard floor of 1000 records, and cache sizing requires tuning for large datasets.

Description

The Record Level Index is an optimization that allows point-lookup queries to prune file groups by directly locating which files contain specific record keys. However, the RLI has several non-obvious behavioral thresholds. First, RLI pruning is silently skipped when the number of keys in a query predicate exceeds read.data.skipping.rli.keys.max.num (default 8). Second, the lookup minibatch size has a hard minimum of 1000 records that cannot be reduced. Third, the per-task cache defaults to 256 MB, which may be insufficient for large datasets, leading to frequent evictions and degraded performance.

Usage

Apply this heuristic when using GLOBAL_RECORD_LEVEL_INDEX for point lookups or upsert operations. Understanding the silent skip thresholds prevents unexpected full-table scans on queries with many filter keys.

The Insight (Rule of Thumb)

  • RLI key threshold: Queries with more than 8 keys in the predicate silently fall back to full scan. Increase read.data.skipping.rli.keys.max.num if your queries filter on more than 8 distinct record keys.
  • Minibatch floor: index.rli.lookup.minibatch.size cannot be set below 1000; values below 1000 are silently replaced with 1000.
  • Cache sizing: Default index.rli.cache.size is 256 MB per bucket-assign task. For datasets with millions of records, increase to 512-1024 MB.
  • Write buffer: index.rli.write.buffer.size defaults to 100 MB; flushes when threshold is hit. Increase for high-throughput ingestion.
  • Prerequisites: RLI requires metadata.enabled=true and index.global.enabled=true.

Reasoning

The 8-key threshold exists because RLI lookups become less efficient than full scans when many keys are queried, as each key requires a separate metadata table lookup. The minibatch floor of 1000 ensures that individual record lookups are always batched for efficiency. Cache sizing directly impacts lookup speed: if the cache is too small, the system repeatedly reads from the metadata table, negating the index benefit.

The silent skip behavior is particularly dangerous because it produces correct results (the query still works) but with significantly worse performance, making it hard to diagnose without monitoring the warning logs.

Code Evidence

Silent skip on key count from RecordLevelIndex.java:151-156:

int maxKeyNum = conf.get(FlinkOptions.READ_DATA_SKIPPING_RLI_KEYS_MAX_NUM);
if (hoodieKeysFromFilter.size() > maxKeyNum) {
  LOG.warn("The number of keys from query predicate: {} exceeds the upper threshold: {}, "
      + "skipping the rli pruning, the keys: {}",
      hoodieKeysFromFilter.size(), maxKeyNum, hoodieKeysFromFilter);
  return Option.empty();
}

Minibatch configuration from FlinkOptions.java:301-309:

public static final ConfigOption<Integer> INDEX_RLI_LOOKUP_MINIBATCH_SIZE = ConfigOptions
    .key("index.rli.lookup.minibatch.size")
    .intType()
    .defaultValue(1000)
    .withDescription("...Default value is 1000, which is also the minimum value for the "
        + "minibatch size, when the configured size is less than 1000, the default value "
        + "will be used.");

Index prerequisite validation from HoodieTableFactory.java:191-196:

if (indexType == HoodieIndex.IndexType.GLOBAL_RECORD_LEVEL_INDEX) {
  ValidationUtils.checkArgument(conf.get(FlinkOptions.METADATA_ENABLED),
      "Metadata table should be enabled when index.type is GLOBAL_RECORD_LEVEL_INDEX.");
  ValidationUtils.checkArgument(conf.get(FlinkOptions.INDEX_GLOBAL_ENABLED),
      "Partition level index updating is not supported for GLOBAL_RECORD_LEVEL_INDEX, "
      + "please set 'index.global.enabled' = 'true'.");
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment