Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Apache Paimon FormatTableScan

From Leeroopedia


Knowledge Sources
Domains Format Tables, File Scanning
Last Updated 2026-02-08 00:00 GMT

Overview

FormatTableScan scans format table directories to discover data files and create data splits with optional partition filtering.

Description

The FormatTableScan class provides functionality for scanning file-based format tables to discover data files. It recursively walks through the directory structure, identifies data files based on naming conventions, and constructs FormatDataSplit objects with file metadata and partition information.

The scanner respects Hive-style partitioning conventions (key=value directory structure) and supports both standard partitioning and partition-path-only-value mode. It filters out reserved directories (starting with "." or "_", and "schema" directories) and hidden/temporary files.

Partition filtering is supported through equality-based partition specifications. When a partition filter is provided, only splits matching all specified partition values are included in the plan. The scanner collects file sizes and paths to create complete split information for the reader.

Usage

Use FormatTableScan when you need to discover files in format tables, implement directory-based table readers, or build scan plans with partition pruning for format tables.

Code Reference

Source Location

Signature

class FormatTableScan:
    """Scanner for format table directories."""

    def __init__(
        self,
        table: FormatTable,
        partition_filter: Optional[Dict[str, str]] = None,
    ):
        """Initialize with table and optional partition filter."""

    def plan(self) -> Plan:
        """Scan directory and create execution plan with data splits."""

Import

from pypaimon.table.format.format_table_scan import FormatTableScan

I/O Contract

Inputs

Name Type Required Description
table FormatTable Yes Format table to scan
partition_filter Dict[str, str] No Partition equality filter (e.g., {"date": "2024-01-01"})

Outputs

Name Type Description
plan Plan Execution plan containing data splits

Usage Examples

from pypaimon.table.format.format_table_scan import FormatTableScan

# Create scanner without filter
scanner = FormatTableScan(table=format_table)
plan = scanner.plan()
splits = plan.splits()
print(f"Found {len(splits)} data files")

# Scan with partition filter
scanner = FormatTableScan(
    table=format_table,
    partition_filter={"date": "2024-01-01", "region": "us-west"}
)
plan = scanner.plan()
filtered_splits = plan.splits()

# Process splits
for split in filtered_splits:
    print(f"File: {split.file_path}")
    print(f"Size: {split.file_size}")
    print(f"Partition: {split.partition}")

# Use with read builder
read_builder = format_table.new_read_builder()
read_builder = read_builder.with_partition_filter({"date": "2024-01-01"})
scanner = read_builder.new_scan()
plan = scanner.plan()

# Example directory structure:
# /data/
#   date=2024-01-01/
#     region=us/
#       data-001.parquet
#       data-002.parquet
#   date=2024-01-02/
#     region=us/
#       data-003.parquet
#
# Scanner discovers all parquet files and extracts partition values

# With partition-only-value mode
table.options["format-table.partition-path-only-value"] = "true"
# Directory structure: /data/2024-01-01/us/data-001.parquet
scanner = FormatTableScan(table=table)
plan = scanner.plan()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment