Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Eventual Inc Daft Cloud Storage Credentials

From Leeroopedia


Knowledge Sources
Domains Cloud Storage, Authentication, AWS S3, Azure Blob, GCS, HuggingFace
Last Updated 2026-02-08 15:30 GMT

Overview

The Cloud_Storage_Credentials environment defines the credential chains, environment variables, and optional dependencies required for Daft to authenticate with cloud storage providers (AWS S3, Azure Blob Storage, Google Cloud Storage) and data platforms (HuggingFace, AI providers).

Description

Daft integrates with multiple cloud storage backends through a combination of Rust-native I/O clients (for S3 and Azure) and Python filesystem abstractions via fsspec and PyArrow. Each provider has its own credential chain with specific environment variables, configuration objects, and fallback behaviors.

The credential resolution follows a layered approach for each provider:

  • AWS S3 -- Uses the standard boto3/botocore credential chain. If no credentials are found, Daft falls back to anonymous access and logs a warning. This allows seamless access to public S3 buckets without configuration.
  • Azure Blob Storage -- Resolves credentials through explicit configuration, then environment variables, then DefaultAzureCredential, and finally falls back to anonymous access.
  • Google Cloud Storage -- Leverages PyArrow's GcsFileSystem and the standard GOOGLE_APPLICATION_CREDENTIALS service account mechanism.
  • HuggingFace -- Uses the HF_TOKEN environment variable for authenticated access to private datasets and models.
  • AI Providers -- API keys for OpenAI (OPENAI_API_KEY) and OpenRouter (OPENROUTER_API_KEY) are resolved from environment variables.

Additionally, Daft's Rust-based S3 client supports custom retry error message patterns via DAFT_S3_RETRY_ERROR_MSGS, allowing users to add application-specific retry logic for transient errors.

Usage

Configure this environment when:

  • Reading or writing data from/to cloud storage (S3, Azure, GCS)
  • Accessing private HuggingFace datasets or models
  • Using AI integrations that require API keys (OpenAI, OpenRouter)
  • Customizing S3 retry behavior for specific error patterns

System Requirements

Category Requirement Notes
Python >= 3.10 Inherited from the core environment
PyArrow >= 9.0 (for GCS) GcsFileSystem requires PyArrow 9.0 or later
Network Internet access Required for all cloud storage operations
Operating System Linux, macOS, Windows All platforms supported

Dependencies

System Packages

  • All core environment system packages

Python Packages

  • All core environment Python packages (pyarrow >= 8.0.0, fsspec, etc.)
  • boto3 < 1.43.0 -- AWS SDK for Python; install via pip install daft[aws]
  • huggingface-hub < 1.2.0 -- HuggingFace Hub client; install via pip install daft[huggingface]
  • datasets < 4.5.0 -- HuggingFace Datasets library; install via pip install daft[huggingface]
  • adlfs -- Azure Data Lake Storage filesystem for fsspec (dev/optional)
  • gcsfs -- Google Cloud Storage filesystem for fsspec (dev/optional)

Credentials

AWS S3

Variable Description Required
AWS_ACCESS_KEY_ID AWS access key ID No (falls back to boto3 credential chain)
AWS_SECRET_ACCESS_KEY AWS secret access key No (falls back to boto3 credential chain)
AWS_SESSION_TOKEN AWS session token for temporary credentials No (optional, for STS)
AWS_DEFAULT_REGION Default AWS region No (defaults to boto3 configuration)
DAFT_S3_RETRY_ERROR_MSGS Comma-separated list of custom error message patterns to trigger retries No (default: empty)

Fallback behavior: If no AWS credentials are found via botocore, Daft enables anonymous access (anon=True) and logs a warning. This allows reading from public S3 buckets without any credential configuration.

Azure Blob Storage

Variable Description Required
AZURE_STORAGE_ACCOUNT Azure Storage account name Yes (if not set in AzureConfig)
AZURE_STORAGE_KEY Azure Storage account access key No (one of key/SAS/token required for private data)
AZURE_STORAGE_SAS_TOKEN Azure Shared Access Signature token No (alternative to access key)
AZURE_STORAGE_TOKEN Azure bearer token (OAuth) No (alternative to access key)
AZURE_ENDPOINT_URL Custom Azure Blob endpoint URL No (for custom endpoints or emulators)

Fallback behavior: If no explicit credentials are provided, Daft attempts DefaultAzureCredential (which tries managed identity, Azure CLI, environment variables, etc.). If that also fails, it falls back to anonymous access.

Google Cloud Storage

Variable Description Required
GOOGLE_APPLICATION_CREDENTIALS Path to a service account JSON key file No (falls back to default application credentials)

Fallback behavior: Uses the standard Google Cloud credential chain (application default credentials, compute engine metadata, etc.).

HuggingFace

Variable Description Required
HF_TOKEN HuggingFace authentication token No (required only for private repos)

AI Providers

Variable Description Required
OPENAI_API_KEY OpenAI API key for LLM/embedding endpoints Yes (when using OpenAI provider)
OPENROUTER_API_KEY OpenRouter API key for multi-provider LLM access Yes (when using OpenRouter provider)

Quick Install

# AWS S3 support
pip install "daft[aws]"

# Azure Blob Storage support (no extra Python deps, uses Rust native client)
pip install daft

# Google Cloud Storage support (no extra Python deps, uses PyArrow GcsFileSystem)
pip install daft

# HuggingFace support
pip install "daft[huggingface]"

# All cloud providers
pip install "daft[aws,huggingface]"

Code Evidence

S3 anonymous fallback from daft/filesystem.py lines 52-69:

def get_filesystem(protocol: str, **kwargs: Any) -> fsspec.AbstractFileSystem:
    if protocol == "s3" or protocol == "s3a":
        try:
            import botocore.session
        except ImportError:
            logger.error(
                "Error when importing botocore. install daft[aws] for the required "
                "3rd party dependencies to interact with AWS S3"
            )
            raise

        s3fs_kwargs = {}

        credentials_available = botocore.session.get_session().get_credentials() is not None
        if not credentials_available:
            logger.warning(
                "AWS credentials not found - using anonymous access to S3 which will "
                "fail if the bucket you are accessing is not a public bucket."
            )
            s3fs_kwargs["anon"] = True

Azure credential resolution from src/daft-io/src/azure_blob.rs lines 177-242:

// Storage account from config or environment
} else if let Ok(storage_account) = std::env::var("AZURE_STORAGE_ACCOUNT") {
    storage_account
}

// Access key from config or environment
let access_key = config.access_key.clone().or_else(|| {
    std::env::var("AZURE_STORAGE_KEY").ok().map(std::convert::Into::into)
});

// SAS token from config or environment
.or_else(|| std::env::var("AZURE_STORAGE_SAS_TOKEN").ok());

// Bearer token from config or environment
.or_else(|| std::env::var("AZURE_STORAGE_TOKEN").ok());

// Endpoint URL from config or environment
std::env::var("AZURE_ENDPOINT_URL").ok()

S3 custom retry error messages from src/daft-io/src/s3_like.rs line 412:

let retry_error_msgs = std::env::var("DAFT_S3_RETRY_ERROR_MSGS")
    .map(|s| s.split(',').map(|s| s.to_string()).collect::<Vec<_>>())
    .unwrap_or_default();

Common Errors

Error Message Cause Solution
Error when importing botocore. install daft[aws] boto3/botocore not installed. Run pip install "daft[aws]".
AWS credentials not found - using anonymous access No AWS credentials detected via botocore. Configure AWS credentials using aws configure, environment variables, or IAM roles. This is a warning; public buckets will still work.
Azure Storage Account not set and is required Neither AzureConfig.storage_account nor AZURE_STORAGE_ACCOUNT is set. Set the AZURE_STORAGE_ACCOUNT environment variable or pass the account name via configuration.
403 Forbidden (S3/Azure/GCS) Credentials are present but lack the required permissions. Verify that the credential has read (and optionally write) access to the target bucket/container.
HF_TOKEN is required for private repositories Attempting to access a private HuggingFace dataset without a token. Set the HF_TOKEN environment variable with a valid HuggingFace access token.

Compatibility Notes

  • AWS S3: The daft[aws] extra installs boto3 < 1.43.0. Daft's Rust-native S3 client handles the actual I/O, while boto3 is used for credential resolution.
  • Azure Blob Storage: The daft[azure] extra currently has no additional Python dependencies; Azure I/O is handled entirely by the Rust-native client. For fsspec-based access, adlfs is available as a dev dependency.
  • GCS: Requires pyarrow >= 9.0 for the GcsFileSystem integration. The daft[gcp] extra currently has no additional Python dependencies.
  • HuggingFace: The daft[huggingface] extra installs both huggingface-hub and datasets libraries.
  • S3 retry customization: The DAFT_S3_RETRY_ERROR_MSGS environment variable accepts a comma-separated list of error message substrings. When an S3 error message matches any of these patterns, Daft will retry the request.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment