Principle:Datahub project Datahub Entity Upsert
| Field | Value |
|---|---|
| Principle Name | Entity Upsert |
| Category | Metadata Persistence |
| Status | Active |
| Last Updated | 2026-02-10 |
| Repository | Datahub_project_Datahub |
Overview
The operation of persisting entity metadata to DataHub by emitting accumulated changes as metadata change proposals. Entity upsert takes a dirty entity (with pending mutations) and emits all changes to DataHub, handling both full aspect upserts and incremental patches with retry logic and exponential backoff for reliability.
Description
The upsert operation is the primary mechanism for writing entity metadata to DataHub. It follows a carefully ordered emission strategy to ensure consistency:
- Bind entity to client: Sets the operation mode (SDK or INGESTION) on the entity for mode-aware operations
- Emit cached aspects: Full aspects from the entity builder (e.g., DatasetProperties set during construction) are emitted as UPSERT MCPs
- Emit pending full aspect MCPs: Aspects from
set*()methods (e.g.,setTags(),setOwners()) that replace entire aspects are emitted as UPSERT MCPs - Wait for completion: All full aspect writes must complete before patches are sent, to prevent write-write races
- Emit patches: Accumulated incremental patches (from
addTag(),addOwner(), etc.) are transformed through a VersionAwarePatchTransformer and emitted with retry logic - Clear dirty flag: After successful emission, the entity's dirty flag is cleared
The upsert operation handles version conflicts through retry with exponential backoff:
- Up to 3 retries with delays of 100ms, 200ms, 400ms
- Version conflict detection via HTTP 422 responses with "Expected version X, actual version Y" patterns
- Each retry re-reads the current server state before re-applying the patch
The VersionAwarePatchTransformer adapts patch MCPs based on the DataHub server version, falling back to read-modify-write full aspect replacements for older servers that lack certain patch templates.
Usage
When saving new or modified entity metadata to DataHub after construction and enrichment. The upsert operation is the final step in the entity lifecycle:
- Construct entity via builder
- Enrich with metadata (tags, owners, terms, domains, custom properties)
- Persist via
client.entities().upsert(entity)
The upsert is idempotent -- upserting an entity that already exists updates it; upserting a new entity creates it.
Theoretical Basis
Upsert (Update or Insert) pattern -- Creates the entity if it does not exist, updates it if it does. This eliminates the need for separate create/update code paths and simplifies the client API.
Optimistic concurrency with retry logic -- Patches use version-aware conflict detection. When a concurrent modification is detected (HTTP 422 version conflict), the operation retries with exponential backoff, re-reading the current server state before re-applying the change.
Ordered emission strategy -- Full aspect writes must complete before patches are sent. This prevents races where a patch could target an aspect that has not yet been created, and ensures patches apply to the most recent version.
Related
- Implemented by: Datahub_project_Datahub_EntityClient_Upsert
Implementation:Datahub_project_Datahub_EntityClient_Upsert
- Depends on: Datahub_project_Datahub_Entity_Metadata_Enrichment
- Depends on: Datahub_project_Datahub_Entity_Construction
- Related Principle: Datahub_project_Datahub_Entity_Read_Modify
- Heuristic: Heuristic:Datahub_project_Datahub_Validation_Across_All_APIs