Environment:Huggingface Datatrove S3 Storage Environment

Knowledge Sources	Huggingface Datatrove pyproject.toml
Domains	Infrastructure, Cloud_Storage
Last Updated	2026-02-14 17:00 GMT

Overview

Amazon S3 cloud storage environment for reading and writing pipeline data to S3-compatible object storage backends.

Description

This environment enables Datatrove pipelines to read input data from and write output data to Amazon S3 (or S3-compatible) storage. It uses `s3fs` built on `fsspec` for transparent filesystem access. S3 paths are used throughout the pipeline for data folders, logging directories, and intermediate results. The Rust-based fast MinHash step 3 tool also has native S3 support for direct S3 read/write without Python.

Usage

Use this environment when your pipeline data resides on S3 or when you want to store pipeline outputs (logs, statistics, results) on S3. Common scenarios include processing Common Crawl data (stored on S3), storing deduplication intermediate results on S3, and writing tokenized output to S3 for training.

System Requirements

Category	Requirement	Notes
Network	Internet access to S3 endpoints	Or internal network access for S3-compatible services (MinIO, Ceph)
Permissions	S3 read/write permissions	IAM role or access keys for the target buckets

Dependencies

Python Packages

`s3fs` >= 2023.12.2 — S3 filesystem interface built on fsspec

Optional (for Rust fast_mh3 tool)

Rust toolchain — For compiling the fast MinHash step 3 S3 backend
`aws-sdk-s3` (Rust crate) — S3 access from the Rust binary

Credentials

The following environment variables or AWS configuration is required:

`AWS_ACCESS_KEY_ID`: AWS access key (or use IAM role)
`AWS_SECRET_ACCESS_KEY`: AWS secret key (or use IAM role)
`AWS_DEFAULT_REGION`: AWS region (optional, defaults to provider config)
AWS credentials can also be provided via `~/.aws/credentials` file or IAM instance profile

Quick Install

# Install datatrove with S3 support
pip install "datatrove[s3]"

# Or install s3fs directly
pip install "s3fs>=2023.12.2"

Code Evidence

S3 dependency group from `pyproject.toml:48-50`:

s3 = [
  "s3fs>=2023.12.2",
]

S3 availability checks from `src/datatrove/utils/_import_utils.py:105-106`:

def is_s3fs_available():
    return _is_package_available("s3fs")

S3 union-find implementation from `src/datatrove/tools/fast_mh3/src/s3_union_find.rs:1-10`:

// S3-based union-find implementation for MinHash step 3
// Reads bucket files from S3 and writes cluster results back to S3

Common Errors

Error Message	Cause	Solution
`ImportError: Please install s3fs`	s3fs not installed	`pip install "datatrove[s3]"`
`NoCredentialsError`	AWS credentials not configured	Set `AWS_ACCESS_KEY_ID`/`AWS_SECRET_ACCESS_KEY` or configure IAM role
`ClientError: Access Denied`	Insufficient S3 permissions	Ensure IAM policy allows s3:GetObject, s3:PutObject, s3:ListBucket on target buckets

Compatibility Notes

s3fs >= 2023.12.2: Version pinned to match the fsspec core dependency version for compatibility.
S3-compatible services: Works with MinIO, Ceph, and other S3-compatible storage backends by configuring the endpoint URL in fsspec/s3fs.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment