Principle:DataExpert io Data engineer handbook Graph Vertex Generation

Overview

Graph Vertex Generation is the process of converting relational data into a vertex format suitable for graph databases and graph-based analytics. Each vertex represents an entity with an identifier, a type label, and a flexible property map constructed from relational columns.

Theory

Transforming relational data into graph vertices involves several key operations:

Deduplication via ROW_NUMBER() Window Function - Relational source tables often contain duplicate rows for the same entity (e.g., a team appearing in multiple seasons). The ROW_NUMBER() window function, partitioned by the entity's natural key and ordered deterministically, assigns a sequence number to each duplicate. Filtering to ROW_NUMBER() = 1 retains exactly one representative row per entity.
Map Construction from Multiple Columns - Graph vertex properties are stored as a key-value map rather than fixed columns. The SQL MAP() function (or equivalent) constructs this map from multiple relational columns, enabling flexible property storage without schema rigidity.
Type Labeling - Each vertex receives a type label (e.g., 'team', 'player') that categorizes it within the graph schema, enabling type-specific traversals and queries.

When to Apply

This pattern is appropriate when:

Building graph databases from relational sources (e.g., converting NBA team data into graph vertices)
Source data contains duplicates that must be resolved before vertex creation
The target graph schema requires flexible property maps rather than fixed columns
Multiple entity types need to be unified under a common vertex format (identifier, type, properties)

Theoretical Basis

The transformation relies on:

ROW_NUMBER() Window Function - deterministic deduplication by assigning ordinal positions within each partition
MAP() Construction - converting named columns into a single key-value property map for schema-flexible storage
Common Vertex Schema - standardizing diverse entities into a uniform (identifier, type, properties) format for graph ingestion

-- Conceptual pattern
WITH entities_deduped AS (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY entity_id ORDER BY some_column) AS row_num
    FROM raw_entities
)
SELECT
    entity_id AS identifier,
    'entity_type' AS type,
    MAP('prop1', col1, 'prop2', col2) AS properties
FROM entities_deduped
WHERE row_num = 1

Related Pages

Implementation:DataExpert_io_Data_engineer_handbook_Do_team_vertex_transformation

Knowledge Sources

Metadata

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment