Principle:Apache Paimon Format Table
| Knowledge Sources | |
|---|---|
| Domains | File_Format, Integration |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Format tables provide a unified abstraction for reading and writing external file formats as if they were native tables, enabling interoperability with diverse data ecosystems.
Description
The format table principle addresses the challenge of integrating with external data sources and sinks that use different file formats than the native table format. Rather than requiring data to be converted into a specific format before processing, the system provides a table abstraction that can read and write formats like Parquet, ORC, Avro, CSV, or JSON directly. This abstraction presents external format files as queryable tables while preserving format-specific features and optimizations.
Format tables decouple the logical table interface from physical storage encoding. The table abstraction defines operations like scan, read, and write, while format-specific implementations handle the details of file parsing, schema mapping, and data encoding. This separation allows query engines to work with a consistent table API regardless of the underlying format. Schema evolution challenges are addressed through explicit mapping layers that translate between the table's logical schema and the format's physical schema, handling cases where field names, types, or nesting structures differ.
Read operations leverage format-specific optimizations like predicate pushdown, column pruning, and vectorized decoding. The format table scan generates splits based on file boundaries, similar to native tables, enabling parallel reading. Write operations coordinate format-specific encoders to generate files that comply with the target format's specification. Statistics collection during writes produces metadata compatible with the format's conventions, ensuring that generated files can be efficiently queried by external systems.
Usage
Apply format tables when building data lake interoperability layers, migrating between storage formats, or exposing external datasets through a unified query interface. This pattern is essential when you need to query heterogeneous data sources without ETL conversion overhead.
Theoretical Basis
The format table pattern provides format-agnostic table operations:
Abstract Format Table Interface
interface FormatTable:
function newScan() -> FormatTableScan
function newRead() -> FormatTableRead
function newWrite() -> FormatTableWrite
function schema() -> Schema
interface FormatTableScan:
function plan() -> list<Split>
function withFilter(predicate) -> FormatTableScan
function withProjection(fields) -> FormatTableScan
interface FormatTableRead:
function createReader(split) -> RecordReader
function readBatch(split) -> RecordBatch
interface FormatTableWrite:
function createWriter(partition) -> RecordWriter
function write(records) -> DataFile
Format-Specific Implementation Pattern
class ParquetFormatTable implements FormatTable:
fileSystem: FileSystem
basePath: string
schema: Schema
parquetOptions: ParquetOptions
function newScan():
return ParquetFormatTableScan(
fileSystem, basePath, schema, parquetOptions
)
function newRead():
return ParquetFormatTableRead(
fileSystem, schema, parquetOptions
)
function newWrite():
return ParquetFormatTableWrite(
fileSystem, basePath, schema, parquetOptions
)
class ParquetFormatTableScan:
function plan():
files = fileSystem.listFiles(basePath, "*.parquet")
splits = []
for each file in files:
// Read Parquet footer for statistics
footer = readParquetFooter(file)
// Apply predicate pushdown using Parquet statistics
if filterMatchesStatistics(filter, footer.statistics):
splits.add(createSplit(file, footer))
return splits
Schema Mapping
function mapSchemaToFormat(logicalSchema, targetFormat):
if targetFormat == "parquet":
return mapToParquetSchema(logicalSchema)
else if targetFormat == "avro":
return mapToAvroSchema(logicalSchema)
else if targetFormat == "orc":
return mapToOrcSchema(logicalSchema)
function mapToParquetSchema(logicalSchema):
parquetFields = []
for each field in logicalSchema.fields:
parquetType = mapLogicalTypeToParquetType(field.type)
parquetFields.add(
name: field.name,
type: parquetType,
repetition: field.nullable ? OPTIONAL : REQUIRED
)
return ParquetSchema(parquetFields)
Read with Format-Specific Optimizations
class ParquetFormatTableRead:
function createReader(split):
parquetFile = openParquetFile(split.filePath)
// Enable Parquet-specific optimizations
reader = parquetFile.createReader()
if projection != null:
// Column pruning
reader.setProjection(projection)
if filter != null:
// Parquet dictionary filtering
reader.setFilter(translateToDictionaryFilter(filter))
// Enable vectorized reading if supported
reader.setVectorizedReading(true)
return reader
Write with Format-Specific Features
class ParquetFormatTableWrite:
function write(records):
writer = createParquetWriter(
schema: schema,
compression: parquetOptions.compression,
pageSize: parquetOptions.pageSize,
rowGroupSize: parquetOptions.rowGroupSize
)
statistics = new Statistics()
for each record in records:
// Convert logical row to Parquet format
parquetRecord = convertToParquetRecord(record)
writer.write(parquetRecord)
// Collect statistics
statistics.update(record)
dataFile = writer.close()
return DataFile(
path: dataFile.path,
size: dataFile.size,
rowCount: records.length,
statistics: statistics
)
Format Discovery and Registry
interface FormatFactory:
function createFormatTable(options) -> FormatTable
class FormatRegistry:
factories: map<string, FormatFactory>
function register(formatName, factory):
factories.put(formatName, factory)
function createTable(formatName, options):
factory = factories.get(formatName)
if factory == null:
throw UnsupportedFormatException(formatName)
return factory.createFormatTable(options)