Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Eventual Inc Daft Delta Lake S3 Locking

From Leeroopedia




Knowledge Sources
Domains Delta Lake, S3 Storage, Concurrency, Data Integrity
Last Updated 2026-02-08 15:30 GMT

Overview

When writing Delta Lake tables to S3 using Daft, concurrent writes from multiple processes require a DynamoDB-based locking mechanism; without it, Daft falls back to AWS_S3_ALLOW_UNSAFE_RENAME=true, which risks data corruption under concurrent access.

Description

Amazon S3 does not provide atomic rename operations. Delta Lake's transaction protocol relies on atomic renames to commit new versions of the transaction log (_delta_log/). On HDFS or local filesystems, renames are atomic, but on S3 they are not -- a rename is implemented as a copy-then-delete, which creates a window where two concurrent writers can both believe they have successfully committed.

The Default Behavior (Unsafe for Concurrent Writes)

When no locking provider is configured, Daft sets the environment variable AWS_S3_ALLOW_UNSAFE_RENAME=true before performing the write. This tells the Delta Lake library to proceed with non-atomic renames, effectively assuming that only a single writer is active at any time. This is explicitly not safe for concurrent writes and can lead to:

  • Lost transactions (one writer's commit overwrites another's).
  • Corrupt transaction logs.
  • Duplicate or missing data in the table.

The Safe Approach (DynamoDB Locking)

For production workloads with multiple writers, the write_deltalake() API accepts a dynamo_table_name parameter. When provided, Daft sets two environment variables:

  • AWS_S3_LOCKING_PROVIDER=dynamodb -- Activates the DynamoDB-based locking provider.
  • DELTA_DYNAMO_TABLE_NAME={name} -- Specifies the DynamoDB table to use for lock coordination.

This ensures that concurrent writers coordinate through DynamoDB, achieving the atomicity guarantees that S3 alone cannot provide.

Version Constraints

  • deltalake >= 0.14.0 is required for Delta Lake write support in Daft (enforced at daft/dataframe/dataframe.py:1276-1278).
  • deltalake < 1.3.0 is the upper bound when installing via pip install daft[deltalake].

API Surface

The write_deltalake() method on the Daft DataFrame accepts the following relevant parameters:

  • path -- The S3 URI (or local path) of the Delta Lake table.
  • mode -- Write mode ("append", "overwrite", etc.).
  • dynamo_table_name -- The name of the DynamoDB table to use for S3 locking.

Usage

Apply this heuristic when:

  • Writing Delta Lake tables to S3 from Daft in any environment where multiple writers may be active.
  • Deploying Daft pipelines to production where data integrity is critical.
  • Running parallel Daft jobs (e.g., multiple Airflow tasks or Ray jobs) that write to the same Delta Lake table.
  • Evaluating whether the default Daft Delta Lake write configuration is safe for your use case.

The Insight (Rule of Thumb)

  • Action: Always provide the dynamo_table_name parameter to write_deltalake() when writing to S3 in production with multiple concurrent writers. For single-writer development or testing scenarios, the default (unsafe rename) is acceptable.
  • Value: DynamoDB locking guarantees transactional integrity for Delta Lake commits on S3, preventing data loss and log corruption that can occur with concurrent non-atomic renames.
  • Trade-off: DynamoDB locking adds latency to each commit (a round-trip to DynamoDB per transaction) and requires provisioning and maintaining a DynamoDB table. For single-writer workloads, this overhead is unnecessary.

Reasoning

The core issue is a mismatch between Delta Lake's transaction protocol and S3's consistency model. Delta Lake was originally designed for HDFS, where rename is an atomic operation provided by the filesystem. When a writer commits a new transaction, it writes the data files first, then atomically renames a temporary log entry (e.g., _delta_log/00000000000000000042.json.tmp to _delta_log/00000000000000000042.json). If two writers attempt to commit the same version simultaneously, only one rename succeeds, and the other writer retries with the next version number.

On S3, rename is not atomic. It is implemented as:

  1. PUT (copy the object to the new key)
  2. DELETE (remove the old key)

Between steps 1 and 2, both the old and new keys exist. If two writers attempt to commit version 42 simultaneously, both may successfully "rename" their log entries, resulting in one overwriting the other. The losing writer's data files become orphaned, and its transaction is silently lost.

DynamoDB locking solves this by using DynamoDB's conditional writes as a distributed lock. Before committing, a writer acquires a lock in the DynamoDB table for the specific table and version. Only one writer can hold the lock at a time, so concurrent commits are serialized. This provides the same atomicity guarantee that HDFS rename provides natively.

Daft's decision to set AWS_S3_ALLOW_UNSAFE_RENAME=true as the default (rather than failing) is a pragmatic choice: it allows single-writer pipelines to work out of the box without requiring DynamoDB setup. However, this makes it the developer's responsibility to enable locking when concurrency is required.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment