Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Apache Paimon Optional Extensions

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Distributed_Computing, Vector_Search
Last Updated 2026-02-08 00:00 GMT

Overview

Optional extension packages (Ray, FAISS, PyTorch, Lance, DuckDB) that enable distributed processing, vector search, ML integration, and SQL query capabilities in PyPaimon.

Description

This environment defines the optional dependencies that extend PyPaimon beyond basic table read/write. Ray enables distributed dataset reading and writing. FAISS provides vector similarity search via global indexes. PyTorch enables reading Paimon tables as `Dataset` or `IterableDataset`. Lance provides an alternative columnar format with predicate pushdown support. DuckDB enables SQL queries over Paimon table data. Each extension is imported dynamically at runtime and is only required when its specific feature is used.

Usage

Use this environment when using distributed processing (Ray), vector similarity search (FAISS), ML training pipelines (PyTorch), Lance format tables (Lance), or SQL analytics (DuckDB). Install only the extensions you need.

System Requirements

Category Requirement Notes
Python >= 3.7 for Ray, >= 3.6 for FAISS Ray does not support Python 3.6
Python >= 3.9 for pylance 0.20+ Lance format requires pylance
Hardware CPU sufficient for most FAISS-GPU optional for faster vector search

Dependencies

Ray (Distributed Computing)

  • `ray` >= 2.10, < 3 (Python 3.7+)
    • Ray 2.48.0+: Schema moved from BlockMetadata to ReadTask (API change)
    • Ray 2.52.0+: `per_task_row_limit` parameter introduced

FAISS (Vector Search)

  • `faiss-cpu` == 1.7.2 (Python 3.6)
  • `faiss-cpu` == 1.7.4 (Python 3.7 - 3.11)
  • `faiss-cpu` >= 1.10, < 2 (Python 3.12+)

PyTorch (ML Integration)

  • `torch` (no version constraint)

Lance (Columnar Format)

  • `pylance` >= 0.20, < 1 (Python 3.9+)
  • `pylance` >= 0.10, < 1 (Python 3.8)
  • `lance` (imported dynamically at runtime)

DuckDB (SQL Analytics)

  • `duckdb` (test dependency; version 1.3.2 tested)

Credentials

No additional credentials required. Storage credentials are handled by Environment:Apache_Paimon_Cloud_Storage_Credentials.

Quick Install

# Install Ray extension
pip install "pypaimon[ray]"

# Install FAISS extension
pip install "pypaimon[faiss]"

# Install PyTorch extension
pip install "pypaimon[torch]"

# Install all optional extensions
pip install "pypaimon[all]"

# Install Lance format support
pip install pylance>=0.20

# Install DuckDB for SQL queries
pip install duckdb

Code Evidence

Ray version compatibility constants from `pypaimon/read/datasource/ray_datasource.py:38-40`:

RAY_VERSION_SCHEMA_IN_READ_TASK = "2.48.0"  # Schema moved from BlockMetadata to ReadTask
RAY_VERSION_PER_TASK_ROW_LIMIT = "2.52.0"   # per_task_row_limit parameter introduced

Ray parallelism auto-reduction from `pypaimon/read/datasource/ray_datasource.py:116-118`:

if parallelism > len(self.splits):
    parallelism = len(self.splits)
    logger.warning(f"Reducing the parallelism to {parallelism}, as that is the number of splits")

FAISS version pinning from `setup.py:60-64`:

# faiss-cpu: optional for vector ANN index. 1.7.x has no wheel for 3.12+; 3.12+ use 1.10+.
'faiss-cpu==1.7.2; python_version >= "3.6" and python_version < "3.7"',
'faiss-cpu==1.7.4; python_version >= "3.7" and python_version < "3.12"',
'faiss-cpu>=1.10,<2; python_version >= "3.12"',

Dynamic Lance import from `pypaimon/read/reader/format_lance_reader.py:38`:

import lance

Dynamic DuckDB import from `pypaimon/read/table_read.py:125`:

import duckdb

Common Errors

Error Message Cause Solution
`ModuleNotFoundError: No module named 'ray'` Ray not installed `pip install "pypaimon[ray]"`
`ModuleNotFoundError: No module named 'faiss'` FAISS not installed `pip install "pypaimon[faiss]"`
`ModuleNotFoundError: No module named 'torch'` PyTorch not installed `pip install "pypaimon[torch]"`
`ModuleNotFoundError: No module named 'lance'` Lance not installed `pip install pylance>=0.20`
`ModuleNotFoundError: No module named 'duckdb'` DuckDB not installed `pip install duckdb`
Ray parallelism warning More Ray tasks than splits Parallelism auto-reduced; not an error

Compatibility Notes

  • Ray on Python 3.6: Not supported. Ray requires Python 3.7+.
  • Ray 2.48+: API breaking change - schema location moved from `BlockMetadata` to `ReadTask`. PyPaimon handles this transparently.
  • Ray 2.52+: New `per_task_row_limit` parameter available for controlling task granularity.
  • FAISS on Python 3.12+: The 1.7.x series has no binary wheels for Python 3.12+. Must use `faiss-cpu` >= 1.10.
  • Lance format: Requires `pylance` >= 0.10 on Python 3.8, >= 0.20 on Python 3.9+. The `lance` package is imported dynamically only when reading Lance-format files.
  • DuckDB: Used via `TableRead.to_duckdb()` method. Imported dynamically only when this method is called.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment