Environment:Apache Paimon Optional Extensions
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Computing, Vector_Search |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Optional extension packages (Ray, FAISS, PyTorch, Lance, DuckDB) that enable distributed processing, vector search, ML integration, and SQL query capabilities in PyPaimon.
Description
This environment defines the optional dependencies that extend PyPaimon beyond basic table read/write. Ray enables distributed dataset reading and writing. FAISS provides vector similarity search via global indexes. PyTorch enables reading Paimon tables as `Dataset` or `IterableDataset`. Lance provides an alternative columnar format with predicate pushdown support. DuckDB enables SQL queries over Paimon table data. Each extension is imported dynamically at runtime and is only required when its specific feature is used.
Usage
Use this environment when using distributed processing (Ray), vector similarity search (FAISS), ML training pipelines (PyTorch), Lance format tables (Lance), or SQL analytics (DuckDB). Install only the extensions you need.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Python | >= 3.7 for Ray, >= 3.6 for FAISS | Ray does not support Python 3.6 |
| Python | >= 3.9 for pylance 0.20+ | Lance format requires pylance |
| Hardware | CPU sufficient for most | FAISS-GPU optional for faster vector search |
Dependencies
Ray (Distributed Computing)
- `ray` >= 2.10, < 3 (Python 3.7+)
- Ray 2.48.0+: Schema moved from BlockMetadata to ReadTask (API change)
- Ray 2.52.0+: `per_task_row_limit` parameter introduced
FAISS (Vector Search)
- `faiss-cpu` == 1.7.2 (Python 3.6)
- `faiss-cpu` == 1.7.4 (Python 3.7 - 3.11)
- `faiss-cpu` >= 1.10, < 2 (Python 3.12+)
PyTorch (ML Integration)
- `torch` (no version constraint)
Lance (Columnar Format)
- `pylance` >= 0.20, < 1 (Python 3.9+)
- `pylance` >= 0.10, < 1 (Python 3.8)
- `lance` (imported dynamically at runtime)
DuckDB (SQL Analytics)
- `duckdb` (test dependency; version 1.3.2 tested)
Credentials
No additional credentials required. Storage credentials are handled by Environment:Apache_Paimon_Cloud_Storage_Credentials.
Quick Install
# Install Ray extension
pip install "pypaimon[ray]"
# Install FAISS extension
pip install "pypaimon[faiss]"
# Install PyTorch extension
pip install "pypaimon[torch]"
# Install all optional extensions
pip install "pypaimon[all]"
# Install Lance format support
pip install pylance>=0.20
# Install DuckDB for SQL queries
pip install duckdb
Code Evidence
Ray version compatibility constants from `pypaimon/read/datasource/ray_datasource.py:38-40`:
RAY_VERSION_SCHEMA_IN_READ_TASK = "2.48.0" # Schema moved from BlockMetadata to ReadTask
RAY_VERSION_PER_TASK_ROW_LIMIT = "2.52.0" # per_task_row_limit parameter introduced
Ray parallelism auto-reduction from `pypaimon/read/datasource/ray_datasource.py:116-118`:
if parallelism > len(self.splits):
parallelism = len(self.splits)
logger.warning(f"Reducing the parallelism to {parallelism}, as that is the number of splits")
FAISS version pinning from `setup.py:60-64`:
# faiss-cpu: optional for vector ANN index. 1.7.x has no wheel for 3.12+; 3.12+ use 1.10+.
'faiss-cpu==1.7.2; python_version >= "3.6" and python_version < "3.7"',
'faiss-cpu==1.7.4; python_version >= "3.7" and python_version < "3.12"',
'faiss-cpu>=1.10,<2; python_version >= "3.12"',
Dynamic Lance import from `pypaimon/read/reader/format_lance_reader.py:38`:
import lance
Dynamic DuckDB import from `pypaimon/read/table_read.py:125`:
import duckdb
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ModuleNotFoundError: No module named 'ray'` | Ray not installed | `pip install "pypaimon[ray]"` |
| `ModuleNotFoundError: No module named 'faiss'` | FAISS not installed | `pip install "pypaimon[faiss]"` |
| `ModuleNotFoundError: No module named 'torch'` | PyTorch not installed | `pip install "pypaimon[torch]"` |
| `ModuleNotFoundError: No module named 'lance'` | Lance not installed | `pip install pylance>=0.20` |
| `ModuleNotFoundError: No module named 'duckdb'` | DuckDB not installed | `pip install duckdb` |
| Ray parallelism warning | More Ray tasks than splits | Parallelism auto-reduced; not an error |
Compatibility Notes
- Ray on Python 3.6: Not supported. Ray requires Python 3.7+.
- Ray 2.48+: API breaking change - schema location moved from `BlockMetadata` to `ReadTask`. PyPaimon handles this transparently.
- Ray 2.52+: New `per_task_row_limit` parameter available for controlling task granularity.
- FAISS on Python 3.12+: The 1.7.x series has no binary wheels for Python 3.12+. Must use `faiss-cpu` >= 1.10.
- Lance format: Requires `pylance` >= 0.10 on Python 3.8, >= 0.20 on Python 3.9+. The `lance` package is imported dynamically only when reading Lance-format files.
- DuckDB: Used via `TableRead.to_duckdb()` method. Imported dynamically only when this method is called.
Related Pages
- Implementation:Apache_Paimon_Ray_Init
- Implementation:Apache_Paimon_CatalogFactory_Create_for_Ray
- Implementation:Apache_Paimon_TableRead_To_Ray
- Implementation:Apache_Paimon_Ray_Dataset_Operations
- Implementation:Apache_Paimon_Ray_Dataset_To_Pandas
- Implementation:Apache_Paimon_Ray_Data_Read_Json
- Implementation:Apache_Paimon_BatchTableWrite_Write_Ray
- Implementation:Apache_Paimon_TableRead_To_Ray_Lance
- Implementation:Apache_Paimon_FaissVectorIndexOptions_Configuration
- Implementation:Apache_Paimon_GlobalIndexScanBuilder_Build
- Implementation:Apache_Paimon_VectorSearch_Construction
- Implementation:Apache_Paimon_GlobalIndexEvaluator_Evaluate
- Implementation:Apache_Paimon_IndexedSplit_Result_Retrieval
- Implementation:Apache_Paimon_FormatLanceReader_Predicate_Pushdown