Principle:Dagster io Dagster External Compute Orchestration
| Field | Value |
|---|---|
| Principle Name | External Compute Orchestration |
| Category | Data Orchestration |
| Domains | Data_Engineering, Serverless, GPU_Computing |
| Repository | dagster-io/dagster |
Overview
Strategy for orchestrating computation in external processes (serverless functions, GPU workers, containers) while maintaining asset metadata reporting back to Dagster.
Description
External compute orchestration allows Dagster to launch and monitor computations running outside the Dagster process. The Dagster Pipes protocol enables external processes to report asset materializations, metadata, and logs back to Dagster. This is essential for GPU workloads (Modal, SageMaker), containerized jobs (Docker, K8s), and serverless functions (AWS Lambda) where the compute environment is separate from Dagster's orchestrator.
The protocol defines a two-way communication channel:
- The orchestrator (Dagster) launches the external process and passes context (asset keys, partition keys, extras) via a context injector.
- The external process receives context, performs computation, and reports results (materializations, metadata, logs) back via a message reader.
- The transport layer (stdout, file, cloud storage) carries messages between the two sides, decoupling them from needing direct network connectivity.
Usage
Use when computation must run in an external environment (GPU clusters, serverless platforms, containers) but you need the results tracked in Dagster's asset graph with proper metadata, lineage, and observability. Common scenarios include:
- GPU workloads -- Machine learning training or inference on Modal, SageMaker, or dedicated GPU servers
- Containerized jobs -- Docker or Kubernetes jobs that run in isolated environments
- Serverless functions -- AWS Lambda, Google Cloud Functions, or Azure Functions triggered by Dagster
- Legacy systems -- Existing scripts or processes that cannot be modified to import Dagster directly
Theoretical Basis
Pipes implements the remote procedure call (RPC) pattern adapted for data orchestration. The protocol defines a message format (materialization events, metadata, logs) that flows from the external process back to Dagster through a transport layer (stdout, file, cloud storage). This decouples the execution environment from the orchestration plane, following the sidecar pattern common in microservices architectures.
Key theoretical properties:
- Separation of concerns -- The orchestration plane (scheduling, dependency management, observability) is decoupled from the execution plane (compute, data processing).
- Transport agnosticism -- The protocol is independent of the transport mechanism, allowing communication over stdout, files, S3, or any custom channel.
- Minimal external dependency -- The external process only needs the lightweight
dagster-pipespackage, not the full Dagster framework.