Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Apache Paimon Cloud Storage Credentials

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Cloud_Storage
Last Updated 2026-02-08 00:00 GMT

Overview

Credentials and environment variables required for accessing remote storage systems (S3, OSS, HDFS) and the Alibaba Cloud DLF catalog service.

Description

This environment defines the storage credentials, endpoints, and environment variables needed to connect PyPaimon to remote storage backends. PyPaimon supports three storage schemes: Alibaba Cloud OSS (`oss://`), Amazon S3 (`s3://`, `s3a://`, `s3n://`), and Hadoop HDFS (`hdfs://`, `viewfs://`). Each scheme requires specific access keys or environment variables. The Alibaba Cloud DLF (Data Lake Formation) catalog additionally requires its own authentication configuration with support for multiple signing algorithms and token loading strategies.

Usage

Use this environment when configuring a PyPaimon catalog with a remote warehouse path (OSS, S3, or HDFS) or when using the REST catalog with DLF authentication. Not required for local filesystem catalogs.

System Requirements

Category Requirement Notes
Network Internet access Required for cloud storage endpoints
OS Linux (for HDFS) HDFS requires Hadoop native libraries on Linux
Hadoop Hadoop installation Required only for HDFS access

Dependencies

System Packages (HDFS only)

  • Hadoop distribution with `bin/hadoop` executable
  • Hadoop native libraries (`lib/native/`)

Python Packages

  • `pyarrow` (for S3FileSystem and HadoopFileSystem)
  • `ossfs` (for PVFS OSS access via fsspec)

Credentials

S3 Storage:

  • `fs.s3.accessKeyId`: AWS access key ID
  • `fs.s3.accessKeySecret`: AWS secret access key
  • `fs.s3.securityToken`: AWS session token (for temporary credentials)
  • `fs.s3.endpoint`: S3-compatible endpoint URL
  • `fs.s3.region`: AWS region

OSS Storage:

  • `fs.oss.accessKeyId`: Alibaba Cloud access key ID
  • `fs.oss.accessKeySecret`: Alibaba Cloud access key secret
  • `fs.oss.securityToken`: STS security token (for temporary credentials)
  • `fs.oss.endpoint`: OSS endpoint URL
  • `fs.oss.region`: OSS region

DLF Catalog Authentication:

  • `dlf.access-key-id`: DLF access key ID
  • `dlf.access-key-secret`: DLF access key secret
  • `dlf.security-token`: DLF security token
  • `dlf.oss-endpoint`: DLF OSS endpoint
  • `dlf.region`: DLF service region
  • `dlf.signing-algorithm`: Signing algorithm (`default` for VPC, `openapi` for DlfNext)

HDFS Environment Variables:

  • `HADOOP_HOME`: Path to Hadoop installation (required, raises RuntimeError if missing)
  • `HADOOP_CONF_DIR`: Path to Hadoop configuration directory (required, raises RuntimeError if missing)
  • `HADOOP_USER_NAME`: HDFS user name (optional, defaults to `hadoop`)
  • `LD_LIBRARY_PATH`: Automatically updated with Hadoop native library path
  • `CLASSPATH`: Automatically populated from `hadoop classpath --glob`

REST Catalog:

  • `token`: Bearer authentication token
  • `uri`: REST catalog server URI

Quick Install

# For S3 access
pip install pypaimon pyarrow>=16

# For OSS access
pip install pypaimon pyarrow>=16 ossfs>=2023

# For HDFS access (requires Hadoop installation)
export HADOOP_HOME=/path/to/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
pip install pypaimon pyarrow>=16

Code Evidence

S3 credential configuration from `pypaimon/common/options/config.py:32-40`:

class S3Options:
    S3_ACCESS_KEY_ID = ConfigOptions.key("fs.s3.accessKeyId").string_type().no_default_value()
    S3_ACCESS_KEY_SECRET = ConfigOptions.key("fs.s3.accessKeySecret").string_type().no_default_value()
    S3_SECURITY_TOKEN = ConfigOptions.key("fs.s3.securityToken").string_type().no_default_value()
    S3_ENDPOINT = ConfigOptions.key("fs.s3.endpoint").string_type().no_default_value()
    S3_REGION = ConfigOptions.key("fs.s3.region").string_type().no_default_value()

HDFS environment checks from `pypaimon/filesystem/pyarrow_file_io.py:155-170`:

if 'HADOOP_HOME' not in os.environ:
    raise RuntimeError("HADOOP_HOME environment variable is not set.")
if 'HADOOP_CONF_DIR' not in os.environ:
    raise RuntimeError("HADOOP_CONF_DIR environment variable is not set.")

hadoop_home = os.environ.get("HADOOP_HOME")
native_lib_path = f"{hadoop_home}/lib/native"
os.environ['LD_LIBRARY_PATH'] = f"{native_lib_path}:{os.environ.get('LD_LIBRARY_PATH', '')}"

class_paths = subprocess.run(
    [f'{hadoop_home}/bin/hadoop', 'classpath', '--glob'],
    capture_output=True, text=True, check=True
)
os.environ['CLASSPATH'] = class_paths.stdout.strip()

OSS validation from `pypaimon/filesystem/pvfs.py:859-870`:

def _get_oss_filesystem(options: Options) -> AbstractFileSystem:
    access_key_id = options.get(OssOptions.OSS_ACCESS_KEY_ID)
    if access_key_id is None:
        raise ValueError("OSS access key id is not found in the options.")
    access_key_secret = options.get(OssOptions.OSS_ACCESS_KEY_SECRET)
    if access_key_secret is None:
        raise ValueError("OSS access key secret is not found in the options.")

Common Errors

Error Message Cause Solution
`RuntimeError: HADOOP_HOME environment variable is not set.` HDFS access without Hadoop Set `HADOOP_HOME` to Hadoop installation path
`RuntimeError: HADOOP_CONF_DIR environment variable is not set.` HDFS access without config Set `HADOOP_CONF_DIR` to Hadoop config directory
`ValueError: OSS access key id is not found in the options.` Missing OSS credentials Pass `fs.oss.accessKeyId` in catalog options
`ValueError: Unrecognized filesystem type in URI: xxx` Unsupported storage scheme Use `oss://`, `s3://`, `s3a://`, `s3n://`, `hdfs://`, or `viewfs://`

Compatibility Notes

  • PyArrow < 7.0: OSS requires bucket-prefixed endpoint URL (e.g., `mybucket.oss-cn-hangzhou.aliyuncs.com`). PyArrow 7+ uses virtual addressing with separate bucket parameter.
  • PyArrow < 8.0: S3 retry strategy (`AwsStandardS3RetryStrategy`) is not available. No automatic retry on transient S3 failures.
  • PyArrow 8.0+: S3 retries use standard strategy with defaults: max_attempts=10, request_timeout=60s, connect_timeout=60s.
  • DLF Signing: The `dlf.signing-algorithm` option supports `default` (for VPC endpoints) and `openapi` (for DlfNext/2026-01-18 API). If not set, it is automatically selected based on the endpoint host.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment