Principle:Spotify Luigi Cloud Data Targets

Template:Knowledge Sources Template:Domains Template:Last Updated

Overview

Cloud Data Targets are abstract representations of files stored in remote object-storage systems that a pipeline task can read from or write to, enabling distributed workflows to consume and produce data without coupling to a local file system.

Description

In pipeline orchestration, a target is the artifact that a task claims to produce. When pipelines operate at scale -- particularly when they involve distributed compute frameworks like Spark -- the data they process rarely resides on a single machine's local disk. Instead, it lives in cloud object stores such as Amazon S3, Google Cloud Storage, or Azure Blob Storage. A Cloud Data Target wraps the object-store path with a uniform interface that supports:

Existence checking -- The orchestrator queries the target to decide whether a task has already completed, preventing redundant recomputation.
Atomic writes -- Data is first written to a temporary local file, then uploaded to the cloud store in a single operation, so partial results are never visible.
Streaming reads -- The target provides a file-like object that streams bytes from the remote store without requiring a full local download first.
Client abstraction -- Authentication, region selection, role assumption, and multi-part upload are encapsulated inside a dedicated client class, keeping task code free of cloud SDK details.

This abstraction is critical for Spark processing pipelines because Spark jobs routinely read input datasets from and write output datasets to cloud storage. By representing those locations as targets within the orchestrator, the scheduler can automatically determine whether upstream data is available before launching expensive distributed compute.

Usage

Use Cloud Data Targets when:

Your pipeline reads from or writes to Amazon S3 (or a compatible object store).
You need the orchestrator to check for task completion by verifying the existence of a remote file.
You want atomic writes so that downstream tasks never see partially written output.
Your Spark jobs require their input and output paths to be expressed as s3:// URIs.

Theoretical Basis

Cloud Data Targets rely on two core design patterns:

1. The Target abstraction (Idempotent Output Marker)

Every task in a dependency-driven pipeline is associated with one or more targets. The scheduler calls target.exists() before running a task. If the target already exists, the task is skipped. This makes the entire pipeline idempotent -- running it multiple times produces the same result without duplicating work. For cloud targets, existence checking translates to an HTTP HEAD request against the object store.

2. Atomic Local-then-Upload (Two-Phase Commit for Files)

Writing directly to a remote store is risky: a network interruption mid-write leaves a corrupt partial file that the scheduler would interpret as a completed task. The solution is a two-phase approach:

Write all output to a temporary local file.
Once the write completes successfully, upload the local file to the cloud store in a single (possibly multi-part) operation.
Delete the temporary local file.

If the upload fails, the remote object never appears, so the scheduler correctly considers the task incomplete on the next run.

Related Pages

Implementation:Spotify_Luigi_S3Target

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment