Implementation:Apache Paimon PaimonVirtualFileSystem
| Knowledge Sources | |
|---|---|
| Domains | Filesystem Abstraction, Storage Integration |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
PaimonVirtualFileSystem (PVFS) implements an fsspec-compatible virtual filesystem with the "pvfs://" protocol that maps Paimon catalog paths to actual storage locations via REST API, enabling transparent access to tables across different storage backends.
Description
PaimonVirtualFileSystem extends `fsspec.AbstractFileSystem` with the "pvfs" protocol, providing a unified view of Paimon tables through hierarchical paths (pvfs://catalog/database/table/subpath). The class parses these paths into hierarchical identifiers (`PVFSCatalogIdentifier`, `PVFSDatabaseIdentifier`, `PVFSTableIdentifier`) and dispatches filesystem operations accordingly. For catalog/database-level operations (ls, exists, mkdir, rm), it calls the REST API to list databases, create/drop databases, etc. For table-level operations, it resolves the table's actual storage location via REST API `get_table()`, determines the `StorageType` (LOCAL or OSS), creates the appropriate underlying filesystem (LocalFileSystem or OSSFileSystem via ossfs), and delegates actual I/O operations. Caching is implemented with TTL and LRU strategies: table metadata uses TTLCache (default 300s), REST API clients use LRUCache per catalog/endpoint, and filesystem instances use LRUCache with refresh logic for temporary credentials. All caches are protected by read-write locks for thread safety. Path translation converts virtual pvfs:// paths to/from actual storage paths by tracking storage location prefixes. The implementation supports database/table CRUD, file I/O (open, cat_file, get_file), directory operations (mkdir, makedirs, rm, rmdir), metadata queries (info, created, modified), and file operations (cp_file, mv). External path support is provided through `ExternalPathProvider` for object table scenarios.
This abstraction enables tools like pandas, PyArrow, and other fsspec-compatible libraries to access Paimon tables using virtual paths without knowledge of underlying storage systems, credentials, or catalog APIs.
Usage
PVFS is registered as an fsspec protocol and can be used with any fsspec-compatible library by providing pvfs:// URIs with appropriate options.
Code Reference
Source Location
- Repository: Apache_Paimon
- File: paimon-python/pypaimon/filesystem/pvfs.py
Signature
class PaimonVirtualFileSystem(fsspec.AbstractFileSystem):
protocol = "pvfs"
def __init__(self, options: Union[Options, Dict[str, str]] = None, **kwargs): ...
def ls(self, path, detail=True, **kwargs): ...
def info(self, path, **kwargs): ...
def exists(self, path, **kwargs): ...
def open(self, path, mode="rb", **kwargs): ...
def mkdir(self, path, create_parents=True, **kwargs): ...
def rm(self, path, recursive=False, maxdepth=None): ...
def mv(self, path1, path2, recursive=False, maxdepth=None, **kwargs): ...
def cp_file(self, path1, path2, **kwargs): ...
def cat_file(self, path, start=None, end=None, **kwargs): ...
def created(self, path): ...
def modified(self, path): ...
Import
from pypaimon.filesystem.pvfs import PaimonVirtualFileSystem
# Or register and use via fsspec
import fsspec
fs = fsspec.filesystem("pvfs", options={...})
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| options | Options/Dict | yes | Configuration including URI, warehouse, authentication |
| path | str | yes | Virtual path in pvfs://catalog/database/table/subpath format |
Outputs
| Name | Type | Description |
|---|---|---|
| File listing | List[Dict/str] | Directory contents with details or paths |
| File info | Dict | File metadata (name, size, type, mtime) |
| File content | bytes/File object | Data from open(), cat_file(), get_file() |
Usage Examples
List Databases and Tables
from pypaimon.filesystem.pvfs import PaimonVirtualFileSystem
# Initialize PVFS
fs = PaimonVirtualFileSystem(options={
"uri": "http://localhost:8080",
"warehouse": "my_warehouse"
})
# List databases
databases = fs.ls("pvfs://my_catalog")
# ['pvfs://my_catalog/default', 'pvfs://my_catalog/test_db']
# List tables
tables = fs.ls("pvfs://my_catalog/default")
# ['pvfs://my_catalog/default/users', 'pvfs://my_catalog/default/orders']
# List table files
files = fs.ls("pvfs://my_catalog/default/users")
# ['pvfs://my_catalog/default/users/manifest', 'pvfs://my_catalog/default/users/snapshot', ...]
Read/Write Files
# Read a manifest file
with fs.open("pvfs://my_catalog/default/users/manifest/manifest-123", "rb") as f:
data = f.read()
# Write a file (for object tables)
with fs.open("pvfs://my_catalog/default/documents/doc1.txt", "wb") as f:
f.write(b"Hello, Paimon!")
# Read file content
content = fs.cat_file("pvfs://my_catalog/default/documents/doc1.txt")
Database Operations
# Create database
fs.mkdir("pvfs://my_catalog/new_db")
# Drop database
fs.rm("pvfs://my_catalog/new_db", recursive=True)
# Check if database exists
exists = fs.exists("pvfs://my_catalog/default") # True
Table Operations
# Create object table
fs.mkdir("pvfs://my_catalog/default/documents")
# Move/rename table
fs.mv(
"pvfs://my_catalog/default/old_table",
"pvfs://my_catalog/default/new_table"
)
# Drop table
fs.rm("pvfs://my_catalog/default/documents", recursive=True)
Integration with Pandas
import pandas as pd
# Read Parquet files from a table
df = pd.read_parquet(
"pvfs://my_catalog/default/users/bucket-0/data-001.parquet",
filesystem=fs
)