Environment:Apache Paimon Cloud Storage Credentials
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Cloud_Storage |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Credentials and environment variables required for accessing remote storage systems (S3, OSS, HDFS) and the Alibaba Cloud DLF catalog service.
Description
This environment defines the storage credentials, endpoints, and environment variables needed to connect PyPaimon to remote storage backends. PyPaimon supports three storage schemes: Alibaba Cloud OSS (`oss://`), Amazon S3 (`s3://`, `s3a://`, `s3n://`), and Hadoop HDFS (`hdfs://`, `viewfs://`). Each scheme requires specific access keys or environment variables. The Alibaba Cloud DLF (Data Lake Formation) catalog additionally requires its own authentication configuration with support for multiple signing algorithms and token loading strategies.
Usage
Use this environment when configuring a PyPaimon catalog with a remote warehouse path (OSS, S3, or HDFS) or when using the REST catalog with DLF authentication. Not required for local filesystem catalogs.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Network | Internet access | Required for cloud storage endpoints |
| OS | Linux (for HDFS) | HDFS requires Hadoop native libraries on Linux |
| Hadoop | Hadoop installation | Required only for HDFS access |
Dependencies
System Packages (HDFS only)
- Hadoop distribution with `bin/hadoop` executable
- Hadoop native libraries (`lib/native/`)
Python Packages
- `pyarrow` (for S3FileSystem and HadoopFileSystem)
- `ossfs` (for PVFS OSS access via fsspec)
Credentials
S3 Storage:
- `fs.s3.accessKeyId`: AWS access key ID
- `fs.s3.accessKeySecret`: AWS secret access key
- `fs.s3.securityToken`: AWS session token (for temporary credentials)
- `fs.s3.endpoint`: S3-compatible endpoint URL
- `fs.s3.region`: AWS region
OSS Storage:
- `fs.oss.accessKeyId`: Alibaba Cloud access key ID
- `fs.oss.accessKeySecret`: Alibaba Cloud access key secret
- `fs.oss.securityToken`: STS security token (for temporary credentials)
- `fs.oss.endpoint`: OSS endpoint URL
- `fs.oss.region`: OSS region
DLF Catalog Authentication:
- `dlf.access-key-id`: DLF access key ID
- `dlf.access-key-secret`: DLF access key secret
- `dlf.security-token`: DLF security token
- `dlf.oss-endpoint`: DLF OSS endpoint
- `dlf.region`: DLF service region
- `dlf.signing-algorithm`: Signing algorithm (`default` for VPC, `openapi` for DlfNext)
HDFS Environment Variables:
- `HADOOP_HOME`: Path to Hadoop installation (required, raises RuntimeError if missing)
- `HADOOP_CONF_DIR`: Path to Hadoop configuration directory (required, raises RuntimeError if missing)
- `HADOOP_USER_NAME`: HDFS user name (optional, defaults to `hadoop`)
- `LD_LIBRARY_PATH`: Automatically updated with Hadoop native library path
- `CLASSPATH`: Automatically populated from `hadoop classpath --glob`
REST Catalog:
- `token`: Bearer authentication token
- `uri`: REST catalog server URI
Quick Install
# For S3 access
pip install pypaimon pyarrow>=16
# For OSS access
pip install pypaimon pyarrow>=16 ossfs>=2023
# For HDFS access (requires Hadoop installation)
export HADOOP_HOME=/path/to/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
pip install pypaimon pyarrow>=16
Code Evidence
S3 credential configuration from `pypaimon/common/options/config.py:32-40`:
class S3Options:
S3_ACCESS_KEY_ID = ConfigOptions.key("fs.s3.accessKeyId").string_type().no_default_value()
S3_ACCESS_KEY_SECRET = ConfigOptions.key("fs.s3.accessKeySecret").string_type().no_default_value()
S3_SECURITY_TOKEN = ConfigOptions.key("fs.s3.securityToken").string_type().no_default_value()
S3_ENDPOINT = ConfigOptions.key("fs.s3.endpoint").string_type().no_default_value()
S3_REGION = ConfigOptions.key("fs.s3.region").string_type().no_default_value()
HDFS environment checks from `pypaimon/filesystem/pyarrow_file_io.py:155-170`:
if 'HADOOP_HOME' not in os.environ:
raise RuntimeError("HADOOP_HOME environment variable is not set.")
if 'HADOOP_CONF_DIR' not in os.environ:
raise RuntimeError("HADOOP_CONF_DIR environment variable is not set.")
hadoop_home = os.environ.get("HADOOP_HOME")
native_lib_path = f"{hadoop_home}/lib/native"
os.environ['LD_LIBRARY_PATH'] = f"{native_lib_path}:{os.environ.get('LD_LIBRARY_PATH', '')}"
class_paths = subprocess.run(
[f'{hadoop_home}/bin/hadoop', 'classpath', '--glob'],
capture_output=True, text=True, check=True
)
os.environ['CLASSPATH'] = class_paths.stdout.strip()
OSS validation from `pypaimon/filesystem/pvfs.py:859-870`:
def _get_oss_filesystem(options: Options) -> AbstractFileSystem:
access_key_id = options.get(OssOptions.OSS_ACCESS_KEY_ID)
if access_key_id is None:
raise ValueError("OSS access key id is not found in the options.")
access_key_secret = options.get(OssOptions.OSS_ACCESS_KEY_SECRET)
if access_key_secret is None:
raise ValueError("OSS access key secret is not found in the options.")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: HADOOP_HOME environment variable is not set.` | HDFS access without Hadoop | Set `HADOOP_HOME` to Hadoop installation path |
| `RuntimeError: HADOOP_CONF_DIR environment variable is not set.` | HDFS access without config | Set `HADOOP_CONF_DIR` to Hadoop config directory |
| `ValueError: OSS access key id is not found in the options.` | Missing OSS credentials | Pass `fs.oss.accessKeyId` in catalog options |
| `ValueError: Unrecognized filesystem type in URI: xxx` | Unsupported storage scheme | Use `oss://`, `s3://`, `s3a://`, `s3n://`, `hdfs://`, or `viewfs://` |
Compatibility Notes
- PyArrow < 7.0: OSS requires bucket-prefixed endpoint URL (e.g., `mybucket.oss-cn-hangzhou.aliyuncs.com`). PyArrow 7+ uses virtual addressing with separate bucket parameter.
- PyArrow < 8.0: S3 retry strategy (`AwsStandardS3RetryStrategy`) is not available. No automatic retry on transient S3 failures.
- PyArrow 8.0+: S3 retries use standard strategy with defaults: max_attempts=10, request_timeout=60s, connect_timeout=60s.
- DLF Signing: The `dlf.signing-algorithm` option supports `default` (for VPC endpoints) and `openapi` (for DlfNext/2026-01-18 API). If not set, it is automatically selected based on the endpoint host.