Principle:Datahub project Datahub Metadata Sink Verification
| Property | Value |
|---|---|
| Page Type | Principle |
| Workflow | Metadata_Ingestion_Pipeline |
| Concept | Verified delivery of metadata records to a destination system |
| Repository | https://github.com/datahub-project/datahub |
| Implemented By | Implementation:Datahub_project_Datahub_DatahubRestSink_Write_Record |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Description
The Metadata Sink Verification principle ensures that every metadata record emitted by the ingestion pipeline is delivered to its destination with confirmed success or a clearly reported failure. In a metadata platform, silent data loss is unacceptable: if a dataset's schema or ownership metadata fails to reach the metadata store, downstream consumers will operate on stale or incomplete information without knowing it.
DataHub addresses this through a write-ahead verification pattern implemented by the sink abstraction. Every call to write a metadata record includes a WriteCallback that is invoked with either a success or failure notification after the write completes (or fails). The pipeline uses these callbacks to maintain accurate counters of records written, records failed, and records with warnings, which are surfaced in the final pipeline summary.
The sink abstraction itself is pluggable: the framework defines a generic Sink interface, and concrete implementations (such as DatahubRestSink, DatahubKafkaSink, and FileSink) handle the specifics of delivering records to their respective backends. This pluggability means the verification contract is consistent regardless of the destination.
The REST sink in particular supports three delivery modes, SYNC, ASYNC, and ASYNC_BATCH, each offering different trade-offs between throughput and ordering guarantees, but all providing the same callback-based verification contract.
Usage
Metadata Sink Verification is active in every pipeline run that writes to a sink (i.e., every non-dry-run execution):
- Standard ingestion: The pipeline writes each transformed metadata record to the sink via
write_record_async(). The callback tracks success/failure per record. - Batch mode: In
ASYNC_BATCHmode, records are grouped by entity URN and emitted in batches to theingestProposalBatchendpoint. Batch-level success or failure is reported back through the same callback mechanism. - Error handling: When a write fails due to an
OperationalError(e.g., a server-side validation rejection), the error details, including the entity URN, stack trace summary, and error message, are recorded in the sink report. Non-operational exceptions (e.g., network timeouts) are also captured and reported. - Dead-letter queue: When the
failure_logconfiguration is enabled, failed records are written to a file-based dead-letter queue via theDeadLetterQueueCallback, enabling later retry or analysis. - Pipeline summary: After all records are processed,
pretty_print_summary()reports the total records written, failure count, and warning count, with color-coded terminal output indicating overall success, warnings, or failures.
Theoretical Basis
The Metadata Sink Verification principle is built on several established patterns:
Write-ahead verification pattern. Each write operation is paired with a callback that reports the outcome. This ensures that the pipeline has a complete accounting of every record's delivery status. Unlike fire-and-forget approaches, this pattern guarantees that no record is silently lost; every record is either confirmed written or confirmed failed.
Sink abstraction for pluggable destinations. The Sink[ConfigType, ReportType] generic base class defines the contract that all sinks must implement: write_record_async(), handle_work_unit_start(), handle_work_unit_end(), and close(). This abstraction allows the pipeline to treat all destinations uniformly, whether the records are going to a REST API, a Kafka topic, or a local file.
Asynchronous emission with bounded concurrency. The REST sink uses a PartitionExecutor or BatchPartitionExecutor to submit write operations to a thread pool. The max_pending_requests parameter bounds the number of in-flight operations, providing backpressure to the pipeline and preventing out-of-memory conditions when the sink is slower than the source. Each completed future triggers the _write_done_callback, which classifies the result and updates the report.
Partitioned ordering guarantees. Records are partitioned by entity URN, ensuring that all metadata changes for the same entity are submitted in order to the same thread. This preserves causal ordering within an entity while allowing concurrent writes across different entities.
Structured error classification. The sink distinguishes between OperationalError (server-side rejections that can be reported as warnings or failures depending on the work unit's treat_errors_as_warnings flag) and unexpected exceptions (always reported as failures). Error information is enriched with the entity URN and work unit ID for traceability, and stack traces are trimmed to avoid overwhelming the report.
Dead-letter queue for failed records. When enabled, records that fail to write are captured by the DeadLetterQueueCallback and persisted to a file sink. This provides a durable record of failures that can be analyzed, corrected, and replayed, a standard pattern in distributed data systems.