Implementation:Apache Paimon FormatTable
| Knowledge Sources | |
|---|---|
| Domains | Format Tables, Table Abstraction |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
FormatTable represents a table stored in a specific file format (Parquet, ORC, CSV, JSON, Text) without Paimon's LSM-tree structure.
Description
The FormatTable class provides a table abstraction for working with data stored in standard file formats. Unlike FileStoreTables that use Paimon's snapshot-based LSM-tree architecture, FormatTables directly read and write files in their native formats with optional partitioning support.
The class implements the Table interface and supports multiple file formats defined by the Format enum: ORC, PARQUET, CSV, TEXT, and JSON. It maintains table schema, file format, location, and configuration options while providing access to field definitions, partition keys, and table metadata.
FormatTable has no primary keys (as it doesn't support key-based operations) but supports partition keys for organizing data. It provides factory methods for creating read and batch write builders, but does not support stream writes.
Usage
Use FormatTable when working with external data in standard formats, importing/exporting data to/from Paimon, or when you need simple file-based storage without versioning or merge capabilities.
Code Reference
Source Location
- Repository: Apache_Paimon
- File: paimon-python/pypaimon/table/format/format_table.py
Signature
class Format(str, Enum):
"""Supported file formats."""
ORC = "orc"
PARQUET = "parquet"
CSV = "csv"
TEXT = "text"
JSON = "json"
@classmethod
def parse(cls, file_format: str) -> "Format":
"""Parse file format string."""
class FormatTable(Table):
"""Table stored in a specific file format."""
def __init__(
self,
file_io: FileIO,
identifier: Identifier,
table_schema: TableSchema,
location: str,
format: Format,
options: Optional[Dict[str, str]] = None,
comment: Optional[str] = None,
):
"""Initialize with table metadata and format."""
def name(self) -> str:
"""Get table name."""
def full_name(self) -> str:
"""Get full table name (database.table)."""
def location(self) -> str:
"""Get table location."""
def format(self) -> Format:
"""Get file format."""
def options(self) -> Dict[str, str]:
"""Get table options."""
def new_read_builder(self):
"""Create a new read builder."""
def new_batch_write_builder(self):
"""Create a new batch write builder."""
def new_stream_write_builder(self):
"""Raise NotImplementedError - stream write not supported."""
Import
from pypaimon.table.format.format_table import FormatTable, Format
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| file_io | FileIO | Yes | File I/O handler |
| identifier | Identifier | Yes | Table identifier |
| table_schema | TableSchema | Yes | Table schema definition |
| location | str | Yes | Root location for table data |
| format | Format | Yes | File format (Parquet, ORC, CSV, JSON, Text) |
| options | Dict[str, str] | No | Table configuration options |
| comment | str | No | Table comment/description |
Outputs
| Name | Type | Description |
|---|---|---|
| table | FormatTable | Format table instance |
| name | str | Table name |
| location | str | Table root location |
| format | Format | File format enum value |
Usage Examples
from pypaimon.table.format.format_table import FormatTable, Format
from pypaimon.schema.table_schema import TableSchema
from pypaimon.schema.data_types import DataField, AtomicType
# Create schema
schema = TableSchema(
version=3,
id=1,
fields=[
DataField(0, "id", AtomicType("BIGINT")),
DataField(1, "name", AtomicType("STRING")),
DataField(2, "date", AtomicType("STRING"))
],
highest_field_id=2,
partition_keys=["date"],
primary_keys=[],
options={}
)
# Create Parquet format table
table = FormatTable(
file_io=file_io,
identifier=Identifier.create("my_db", "my_table"),
table_schema=schema,
location="/path/to/data",
format=Format.PARQUET,
options={"format-table.partition-path-only-value": "false"}
)
# Read from format table
read_builder = table.new_read_builder()
read_builder = read_builder.with_projection(["id", "name"])
scan = read_builder.new_scan()
read = read_builder.new_read()
splits = scan.plan().splits()
df = read.to_pandas(splits)
# Write to format table
write_builder = table.new_batch_write_builder()
writer = write_builder.new_write()
writer.write_pandas(df)
commit_messages = writer.prepare_commit()
# Parse format from string
fmt = Format.parse("orc") # Returns Format.ORC
print(f"Table: {table.full_name()}")
print(f"Location: {table.location()}")
print(f"Format: {table.format().value}")
print(f"Partition keys: {table.partition_keys}")