Environment:Huggingface Datatrove S3 Storage Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Cloud_Storage |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Amazon S3 cloud storage environment for reading and writing pipeline data to S3-compatible object storage backends.
Description
This environment enables Datatrove pipelines to read input data from and write output data to Amazon S3 (or S3-compatible) storage. It uses `s3fs` built on `fsspec` for transparent filesystem access. S3 paths are used throughout the pipeline for data folders, logging directories, and intermediate results. The Rust-based fast MinHash step 3 tool also has native S3 support for direct S3 read/write without Python.
Usage
Use this environment when your pipeline data resides on S3 or when you want to store pipeline outputs (logs, statistics, results) on S3. Common scenarios include processing Common Crawl data (stored on S3), storing deduplication intermediate results on S3, and writing tokenized output to S3 for training.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Network | Internet access to S3 endpoints | Or internal network access for S3-compatible services (MinIO, Ceph) |
| Permissions | S3 read/write permissions | IAM role or access keys for the target buckets |
Dependencies
Python Packages
- `s3fs` >= 2023.12.2 — S3 filesystem interface built on fsspec
Optional (for Rust fast_mh3 tool)
- Rust toolchain — For compiling the fast MinHash step 3 S3 backend
- `aws-sdk-s3` (Rust crate) — S3 access from the Rust binary
Credentials
The following environment variables or AWS configuration is required:
- `AWS_ACCESS_KEY_ID`: AWS access key (or use IAM role)
- `AWS_SECRET_ACCESS_KEY`: AWS secret key (or use IAM role)
- `AWS_DEFAULT_REGION`: AWS region (optional, defaults to provider config)
- AWS credentials can also be provided via `~/.aws/credentials` file or IAM instance profile
Quick Install
# Install datatrove with S3 support
pip install "datatrove[s3]"
# Or install s3fs directly
pip install "s3fs>=2023.12.2"
Code Evidence
S3 dependency group from `pyproject.toml:48-50`:
s3 = [
"s3fs>=2023.12.2",
]
S3 availability checks from `src/datatrove/utils/_import_utils.py:105-106`:
def is_s3fs_available():
return _is_package_available("s3fs")
S3 union-find implementation from `src/datatrove/tools/fast_mh3/src/s3_union_find.rs:1-10`:
// S3-based union-find implementation for MinHash step 3
// Reads bucket files from S3 and writes cluster results back to S3
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: Please install s3fs` | s3fs not installed | `pip install "datatrove[s3]"` |
| `NoCredentialsError` | AWS credentials not configured | Set `AWS_ACCESS_KEY_ID`/`AWS_SECRET_ACCESS_KEY` or configure IAM role |
| `ClientError: Access Denied` | Insufficient S3 permissions | Ensure IAM policy allows s3:GetObject, s3:PutObject, s3:ListBucket on target buckets |
Compatibility Notes
- s3fs >= 2023.12.2: Version pinned to match the fsspec core dependency version for compatibility.
- S3-compatible services: Works with MinIO, Ceph, and other S3-compatible storage backends by configuring the endpoint URL in fsspec/s3fs.