Principle:Apache Hudi Query Type Definition

Knowledge Sources	Apache Hudi
Domains	Data_Lake, Stream_Processing
Last Updated	2026-02-08 00:00 GMT

Overview

Translating a declarative read request into an executable source operator topology that matches the query semantics.

Description

Once the read environment has been classified (snapshot, incremental, or streaming), the system must materialize that classification into a concrete execution plan. The query type definition step takes the abstract query intent and produces a scan runtime provider -- the physical operator graph that will actually read data from the lakehouse table.

This principle addresses several concerns:

Operator selection: Different query types require fundamentally different operator topologies. A bounded batch read can use a simple input format source. An incremental read requires split enumeration with commit range awareness. A streaming read requires a continuously monitoring source that tails the timeline for new commits.
Projection and filter pushdown: The scan context carries information about which columns are needed (projection pushdown) and which rows can be skipped (filter pushdown). These optimizations must be wired into the source operator at plan time, not at execution time.
API versioning: Modern stream processing frameworks offer multiple source API generations (e.g., legacy SourceFunction vs. FLIP-27 Source interface). The query type definition step selects the appropriate API surface based on configuration, ensuring backward compatibility while enabling new features.

The output of this step is a ScanRuntimeProvider -- an opaque factory that, when invoked by the framework, produces the actual DataStream of records. This indirection allows the planner to inspect properties of the provider (e.g., whether it is bounded) without eagerly constructing the execution graph.

Usage

Use this technique whenever a table source must support multiple read strategies through a single entry point. It is the natural bridge between the SQL/Table API layer (which speaks in terms of scans and filters) and the DataStream layer (which speaks in terms of operators and splits). Specific scenarios include:

Flink SQL queries such as SELECT * FROM hudi_table /*+ OPTIONS('read.start-commit'='...') */
Table API programs that read from a Hudi catalog table
Any connector that must present both bounded and unbounded behavior depending on configuration

Theoretical Basis

The query type definition follows the Abstract Factory pattern. The table source acts as a factory that produces a runtime provider, and the specific provider implementation depends on the query type:

function getScanRuntimeProvider(scanContext):
    queryType = resolveQueryType(configuration)
    projectedFields = scanContext.projectedFields
    filterPredicates = scanContext.filterPredicates

    provider = new ScanRuntimeProvider:
        isBounded = (queryType != STREAMING)

        produceDataStream(env):
            if configuration.useSourceV2:
                return buildFLIP27Source(env, queryType, projectedFields, filterPredicates)
            else:
                if queryType == STREAMING:
                    monitorFunc = new TimelineMonitorFunction(configuration)
                    readerOp   = new SplitReaderOperator(inputFormat)
                    return env.addSource(monitorFunc).transform(readerOp)
                else:
                    inputFormat = buildInputFormat(projectedFields, filterPredicates)
                    return env.addSource(inputFormat)

    return provider

The key theoretical insight is deferred execution: the provider does not build the execution graph immediately. Instead, it encapsulates the recipe for building the graph, allowing the framework's planner to make additional optimizations (e.g., parallelism inference, chaining decisions) before the graph is materialized.

This pattern also demonstrates strategy separation: the V1 (legacy) and V2 (FLIP-27) source implementations are selected via configuration without changing the external contract. Each strategy internally manages its own split discovery, enumeration, and reading lifecycle.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment