Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Apache Paimon Format Table

From Leeroopedia


Knowledge Sources
Domains File_Format, Integration
Last Updated 2026-02-08 00:00 GMT

Overview

Format tables provide a unified abstraction for reading and writing external file formats as if they were native tables, enabling interoperability with diverse data ecosystems.

Description

The format table principle addresses the challenge of integrating with external data sources and sinks that use different file formats than the native table format. Rather than requiring data to be converted into a specific format before processing, the system provides a table abstraction that can read and write formats like Parquet, ORC, Avro, CSV, or JSON directly. This abstraction presents external format files as queryable tables while preserving format-specific features and optimizations.

Format tables decouple the logical table interface from physical storage encoding. The table abstraction defines operations like scan, read, and write, while format-specific implementations handle the details of file parsing, schema mapping, and data encoding. This separation allows query engines to work with a consistent table API regardless of the underlying format. Schema evolution challenges are addressed through explicit mapping layers that translate between the table's logical schema and the format's physical schema, handling cases where field names, types, or nesting structures differ.

Read operations leverage format-specific optimizations like predicate pushdown, column pruning, and vectorized decoding. The format table scan generates splits based on file boundaries, similar to native tables, enabling parallel reading. Write operations coordinate format-specific encoders to generate files that comply with the target format's specification. Statistics collection during writes produces metadata compatible with the format's conventions, ensuring that generated files can be efficiently queried by external systems.

Usage

Apply format tables when building data lake interoperability layers, migrating between storage formats, or exposing external datasets through a unified query interface. This pattern is essential when you need to query heterogeneous data sources without ETL conversion overhead.

Theoretical Basis

The format table pattern provides format-agnostic table operations:

Abstract Format Table Interface

interface FormatTable:
    function newScan() -> FormatTableScan
    function newRead() -> FormatTableRead
    function newWrite() -> FormatTableWrite
    function schema() -> Schema

interface FormatTableScan:
    function plan() -> list<Split>
    function withFilter(predicate) -> FormatTableScan
    function withProjection(fields) -> FormatTableScan

interface FormatTableRead:
    function createReader(split) -> RecordReader
    function readBatch(split) -> RecordBatch

interface FormatTableWrite:
    function createWriter(partition) -> RecordWriter
    function write(records) -> DataFile

Format-Specific Implementation Pattern

class ParquetFormatTable implements FormatTable:
    fileSystem: FileSystem
    basePath: string
    schema: Schema
    parquetOptions: ParquetOptions

    function newScan():
        return ParquetFormatTableScan(
            fileSystem, basePath, schema, parquetOptions
        )

    function newRead():
        return ParquetFormatTableRead(
            fileSystem, schema, parquetOptions
        )

    function newWrite():
        return ParquetFormatTableWrite(
            fileSystem, basePath, schema, parquetOptions
        )

class ParquetFormatTableScan:
    function plan():
        files = fileSystem.listFiles(basePath, "*.parquet")

        splits = []
        for each file in files:
            // Read Parquet footer for statistics
            footer = readParquetFooter(file)

            // Apply predicate pushdown using Parquet statistics
            if filterMatchesStatistics(filter, footer.statistics):
                splits.add(createSplit(file, footer))

        return splits

Schema Mapping

function mapSchemaToFormat(logicalSchema, targetFormat):
    if targetFormat == "parquet":
        return mapToParquetSchema(logicalSchema)
    else if targetFormat == "avro":
        return mapToAvroSchema(logicalSchema)
    else if targetFormat == "orc":
        return mapToOrcSchema(logicalSchema)

function mapToParquetSchema(logicalSchema):
    parquetFields = []

    for each field in logicalSchema.fields:
        parquetType = mapLogicalTypeToParquetType(field.type)
        parquetFields.add(
            name: field.name,
            type: parquetType,
            repetition: field.nullable ? OPTIONAL : REQUIRED
        )

    return ParquetSchema(parquetFields)

Read with Format-Specific Optimizations

class ParquetFormatTableRead:
    function createReader(split):
        parquetFile = openParquetFile(split.filePath)

        // Enable Parquet-specific optimizations
        reader = parquetFile.createReader()

        if projection != null:
            // Column pruning
            reader.setProjection(projection)

        if filter != null:
            // Parquet dictionary filtering
            reader.setFilter(translateToDictionaryFilter(filter))

        // Enable vectorized reading if supported
        reader.setVectorizedReading(true)

        return reader

Write with Format-Specific Features

class ParquetFormatTableWrite:
    function write(records):
        writer = createParquetWriter(
            schema: schema,
            compression: parquetOptions.compression,
            pageSize: parquetOptions.pageSize,
            rowGroupSize: parquetOptions.rowGroupSize
        )

        statistics = new Statistics()

        for each record in records:
            // Convert logical row to Parquet format
            parquetRecord = convertToParquetRecord(record)
            writer.write(parquetRecord)

            // Collect statistics
            statistics.update(record)

        dataFile = writer.close()

        return DataFile(
            path: dataFile.path,
            size: dataFile.size,
            rowCount: records.length,
            statistics: statistics
        )

Format Discovery and Registry

interface FormatFactory:
    function createFormatTable(options) -> FormatTable

class FormatRegistry:
    factories: map<string, FormatFactory>

    function register(formatName, factory):
        factories.put(formatName, factory)

    function createTable(formatName, options):
        factory = factories.get(formatName)
        if factory == null:
            throw UnsupportedFormatException(formatName)

        return factory.createFormatTable(options)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment