Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:ArroyoSystems Arroyo Object Storage

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Storage
Last Updated 2026-02-08 08:00 GMT

Overview

Multi-backend object storage environment supporting S3, GCS, Azure Blob, Cloudflare R2, and local filesystem for checkpoint and artifact persistence.

Description

Arroyo uses a unified `StorageProvider` abstraction (built on the `object_store` crate) to read and write checkpoint data, state snapshots, and compiled UDF artifacts. The storage backend is determined by the URL scheme of the configured `checkpoint-url`. Supported backends include AWS S3 (including custom endpoints), Google Cloud Storage, Azure Blob Storage, Cloudflare R2, and the local filesystem. Each backend supports multiple URL formats for flexibility. S3 connections use custom retry and timeout settings optimized for streaming workloads.

Usage

Use this environment for all checkpoint and state storage in Arroyo. Every pipeline stores its checkpoint data (Parquet state files) and the compiler service stores compiled UDF artifacts in this storage backend. The local filesystem is the default for development; cloud storage is recommended for production.

System Requirements

Category Requirement Notes
Storage Any supported backend S3, GCS, Azure Blob, R2, or local filesystem
Network HTTPS access to cloud storage Unless using local filesystem or custom HTTP endpoint
Disk (local) 10GB+ SSD Default path: `/tmp/arroyo/checkpoints`

Dependencies

Rust Crate Dependencies

  • `object_store` = 0.12.3 (unified storage abstraction)
  • `aws-config` = 1.5.13 (AWS SDK configuration)
  • `aws-credential-types` = 1.2.0 (AWS credential management)

Credentials

AWS S3:

  • `AWS_DEFAULT_REGION`: S3 region
  • `AWS_ENDPOINT`: Custom S3-compatible endpoint URL
  • `AWS_ACCESS_KEY_ID`: AWS access key
  • `AWS_SECRET_ACCESS_KEY`: AWS secret key
  • `AWS_SESSION_TOKEN`: Optional session token for temporary credentials

Cloudflare R2:

  • `CLOUDFLARE_ACCOUNT_ID`: Cloudflare account ID (if not in URL)
  • `R2_ACCESS_KEY_ID`: R2 access key (falls back to `AWS_ACCESS_KEY_ID`)
  • `R2_SECRET_ACCESS_KEY`: R2 secret key (falls back to `AWS_SECRET_ACCESS_KEY`)

Google Cloud Storage:

  • `GOOGLE_SERVICE_ACCOUNT_KEY`: GCS service account JSON key

Azure Blob Storage:

  • Standard Azure SDK environment variables (e.g., `AZURE_STORAGE_ACCOUNT`, `AZURE_STORAGE_KEY`)

Configuration:

  • `ARROYO__CHECKPOINT_URL`: Checkpoint storage URL (default: `/tmp/arroyo/checkpoints`)
  • `ARROYO__COMPILER__ARTIFACT_URL`: Compiled artifact URL (default: `/tmp/arroyo/artifacts`)

Quick Install

# Local filesystem (default, no setup needed)
# Checkpoints stored at /tmp/arroyo/checkpoints

# AWS S3
export ARROYO__CHECKPOINT_URL=s3://my-bucket/arroyo/checkpoints
export AWS_DEFAULT_REGION=us-east-1
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret

# Google Cloud Storage
export ARROYO__CHECKPOINT_URL=gs://my-bucket/arroyo/checkpoints
export GOOGLE_SERVICE_ACCOUNT_KEY='{"type":"service_account",...}'

# MinIO / S3-compatible
export ARROYO__CHECKPOINT_URL="s3::http://localhost:9000/my-bucket/checkpoints"

Code Evidence

URL parsing patterns from `lib.rs:48-80`:

// S3 URL formats
const S3_PATH: &str =
    r"^https://s3\.(?P<region>[\w\-]+)\.amazonaws\.com/(?P<bucket>[a-z0-9\-\.]+)(/(?P<key>.+))?$";
const S3_VIRTUAL: &str =
    r"^https://(?P<bucket>[a-z0-9\-\.]+)\.s3\.(?P<region>[\w\-]+)\.amazonaws\.com(/(?P<key>.+))?$";
const S3_URL: &str = r"^[sS]3[aA]?://(?P<bucket>[a-z0-9\-\.]+)(/(?P<key>.+))?$";
const S3_ENDPOINT_URL: &str = r"^[sS]3[aA]?::(?<protocol>https?)://...";

// GCS URL formats
const GCS_VIRTUAL: &str =
    r"^https://(?P<bucket>[a-z\d\-_\.]+)\.storage\.googleapis\.com(/(?P<key>.+))?$";
const GCS_URL: &str = r"^[gG][sS]://(?P<bucket>[a-z0-9\-\.]+)(/(?P<key>.+))?$";

// Cloudflare R2 URL formats
const R2_URL: &str =
    r"^[rR]2://((?P<account_id>[a-zA-Z0-9]+)@)?(?P<bucket>[a-z0-9\-\.]+)(/(?P<key>.+))?$";

// Local filesystem
const FILE_URI: &str = r"^file://(?P<path>.*)$";
const FILE_PATH: &str = r"^/(?P<path>.*)$";

Default configuration from `default.toml:1,37-38`:

checkpoint-url = "/tmp/arroyo/checkpoints"

[compiler]
artifact-url = "/tmp/arroyo/artifacts"
build-dir = "/tmp/arroyo/build-dir"

Common Errors

Error Message Cause Solution
`NoSuchBucket` S3 bucket does not exist Create the bucket or check the URL
`AccessDenied` Insufficient S3/GCS/Azure permissions Verify credentials and IAM permissions
`No such file or directory` Local filesystem path does not exist Arroyo auto-creates local paths; check parent directory permissions
`CLOUDFLARE_ACCOUNT_ID not set` R2 account ID not configured Set `CLOUDFLARE_ACCOUNT_ID` env var or include in URL

Compatibility Notes

  • S3-compatible services: MinIO, Ceph, and other S3-compatible services are supported via the `s3::http://endpoint:port/bucket` URL format. Set `AWS_ENDPOINT` for custom endpoints.
  • S3 retry strategy: Arroyo sets `max_retries=0` on the object_store S3 client because it handles retries at a higher level.
  • S3 timeout: `operation_timeout=60s`, `operation_attempt_timeout=5s`
  • Virtual hosted-style: Enabled by default for S3; disabled when using custom endpoints.
  • Local filesystem: Auto-creates directories on first write. Default path `/tmp/arroyo/checkpoints` is ephemeral; use persistent storage for production.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment