Implementation:Apache Paimon FormatTableScan
| Knowledge Sources | |
|---|---|
| Domains | Format Tables, File Scanning |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
FormatTableScan scans format table directories to discover data files and create data splits with optional partition filtering.
Description
The FormatTableScan class provides functionality for scanning file-based format tables to discover data files. It recursively walks through the directory structure, identifies data files based on naming conventions, and constructs FormatDataSplit objects with file metadata and partition information.
The scanner respects Hive-style partitioning conventions (key=value directory structure) and supports both standard partitioning and partition-path-only-value mode. It filters out reserved directories (starting with "." or "_", and "schema" directories) and hidden/temporary files.
Partition filtering is supported through equality-based partition specifications. When a partition filter is provided, only splits matching all specified partition values are included in the plan. The scanner collects file sizes and paths to create complete split information for the reader.
Usage
Use FormatTableScan when you need to discover files in format tables, implement directory-based table readers, or build scan plans with partition pruning for format tables.
Code Reference
Source Location
- Repository: Apache_Paimon
- File: paimon-python/pypaimon/table/format/format_table_scan.py
Signature
class FormatTableScan:
"""Scanner for format table directories."""
def __init__(
self,
table: FormatTable,
partition_filter: Optional[Dict[str, str]] = None,
):
"""Initialize with table and optional partition filter."""
def plan(self) -> Plan:
"""Scan directory and create execution plan with data splits."""
Import
from pypaimon.table.format.format_table_scan import FormatTableScan
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| table | FormatTable | Yes | Format table to scan |
| partition_filter | Dict[str, str] | No | Partition equality filter (e.g., {"date": "2024-01-01"}) |
Outputs
| Name | Type | Description |
|---|---|---|
| plan | Plan | Execution plan containing data splits |
Usage Examples
from pypaimon.table.format.format_table_scan import FormatTableScan
# Create scanner without filter
scanner = FormatTableScan(table=format_table)
plan = scanner.plan()
splits = plan.splits()
print(f"Found {len(splits)} data files")
# Scan with partition filter
scanner = FormatTableScan(
table=format_table,
partition_filter={"date": "2024-01-01", "region": "us-west"}
)
plan = scanner.plan()
filtered_splits = plan.splits()
# Process splits
for split in filtered_splits:
print(f"File: {split.file_path}")
print(f"Size: {split.file_size}")
print(f"Partition: {split.partition}")
# Use with read builder
read_builder = format_table.new_read_builder()
read_builder = read_builder.with_partition_filter({"date": "2024-01-01"})
scanner = read_builder.new_scan()
plan = scanner.plan()
# Example directory structure:
# /data/
# date=2024-01-01/
# region=us/
# data-001.parquet
# data-002.parquet
# date=2024-01-02/
# region=us/
# data-003.parquet
#
# Scanner discovers all parquet files and extracts partition values
# With partition-only-value mode
table.options["format-table.partition-path-only-value"] = "true"
# Directory structure: /data/2024-01-01/us/data-001.parquet
scanner = FormatTableScan(table=table)
plan = scanner.plan()