Implementation:Apache Paimon FormatTableScan

Knowledge Sources	Apache_Paimon
Domains	Format Tables, File Scanning
Last Updated	2026-02-08 00:00 GMT

Overview

FormatTableScan scans format table directories to discover data files and create data splits with optional partition filtering.

Description

The FormatTableScan class provides functionality for scanning file-based format tables to discover data files. It recursively walks through the directory structure, identifies data files based on naming conventions, and constructs FormatDataSplit objects with file metadata and partition information.

The scanner respects Hive-style partitioning conventions (key=value directory structure) and supports both standard partitioning and partition-path-only-value mode. It filters out reserved directories (starting with "." or "_", and "schema" directories) and hidden/temporary files.

Partition filtering is supported through equality-based partition specifications. When a partition filter is provided, only splits matching all specified partition values are included in the plan. The scanner collects file sizes and paths to create complete split information for the reader.

Usage

Use FormatTableScan when you need to discover files in format tables, implement directory-based table readers, or build scan plans with partition pruning for format tables.

Code Reference

Source Location

Repository: Apache_Paimon
File: paimon-python/pypaimon/table/format/format_table_scan.py

Signature

class FormatTableScan:
    """Scanner for format table directories."""

    def __init__(
        self,
        table: FormatTable,
        partition_filter: Optional[Dict[str, str]] = None,
    ):
        """Initialize with table and optional partition filter."""

    def plan(self) -> Plan:
        """Scan directory and create execution plan with data splits."""

Import

from pypaimon.table.format.format_table_scan import FormatTableScan

I/O Contract

Inputs

Name	Type	Required	Description
table	FormatTable	Yes	Format table to scan
partition_filter	Dict[str, str]	No	Partition equality filter (e.g., {"date": "2024-01-01"})

Outputs

Name	Type	Description
plan	Plan	Execution plan containing data splits

Usage Examples

from pypaimon.table.format.format_table_scan import FormatTableScan

# Create scanner without filter
scanner = FormatTableScan(table=format_table)
plan = scanner.plan()
splits = plan.splits()
print(f"Found {len(splits)} data files")

# Scan with partition filter
scanner = FormatTableScan(
    table=format_table,
    partition_filter={"date": "2024-01-01", "region": "us-west"}
)
plan = scanner.plan()
filtered_splits = plan.splits()

# Process splits
for split in filtered_splits:
    print(f"File: {split.file_path}")
    print(f"Size: {split.file_size}")
    print(f"Partition: {split.partition}")

# Use with read builder
read_builder = format_table.new_read_builder()
read_builder = read_builder.with_partition_filter({"date": "2024-01-01"})
scanner = read_builder.new_scan()
plan = scanner.plan()

# Example directory structure:
# /data/
#   date=2024-01-01/
#     region=us/
#       data-001.parquet
#       data-002.parquet
#   date=2024-01-02/
#     region=us/
#       data-003.parquet
#
# Scanner discovers all parquet files and extracts partition values

# With partition-only-value mode
table.options["format-table.partition-path-only-value"] = "true"
# Directory structure: /data/2024-01-01/us/data-001.parquet
scanner = FormatTableScan(table=table)
plan = scanner.plan()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment