Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Spotify Luigi Table Schema Definition

From Leeroopedia


Knowledge Sources Spotify Luigi Repository
Domains Pipeline_Orchestration, Database, ETL
Last Updated 2026-02-10 00:00 GMT

Overview

Table Schema Definition is the practice of declaring the column names, data types, and constraints of a database table within the pipeline task itself, so that the target table can be automatically created or validated before data is loaded.

Description

In a database ingestion pipeline, the destination table must exist and have a compatible schema before data can be inserted. Table Schema Definition addresses how the pipeline task declares the expected structure of the target table -- column names, their SQL types, and optional constraints -- as a first-class part of the tasks configuration.

This principle involves several interrelated concerns:

  • Declarative column specification: The task class carries a columns attribute that lists each columns name and type. This makes the schema self-documenting and co-located with the data loading logic, rather than scattered across separate DDL scripts.
  • Automatic table creation: When the target table does not yet exist, the framework uses the declared column definitions to generate and execute a CREATE TABLE statement. This eliminates manual schema migration steps for new tables.
  • Graceful handling of existing tables: If the table already exists, the creation step is either skipped or the existing schema is reflected (loaded from the database), ensuring that the pipeline does not fail on repeated runs.
  • Format flexibility: Different backends may accept different formats for column specifications. A simple RDBMS backend might accept tuples of (name, type_string), while an ORM-based backend might accept richer constructs with keyword arguments for constraints, indexes, and defaults.
  • Schema-data contract: The column definitions serve as a contract between the data producer (the rows() method) and the data consumer (the database table). Each row must yield values in the same order and of compatible types as declared in columns.

The general pattern is: the task declares columns, the create_table() method reads those declarations and translates them into the backend-appropriate DDL, and the run() method calls create_table() if the table is missing before proceeding to load data.

Usage

Use Table Schema Definition when:

  • Your pipeline creates new database tables as part of its output, and you want the table structure to be defined alongside the task code rather than maintained separately.
  • You need automatic table creation on first run so that new environments (development, staging) can bootstrap without manual DDL execution.
  • You want the schema to serve as documentation of the expected data format for downstream consumers.
  • Your pipeline must support multiple database backends, each of which may have different DDL syntax, and you want a single declarative format that the framework translates per backend.
  • You are using a reflect-capable ORM layer and want to bind to an existing tables schema at runtime rather than redeclaring it.

Theoretical Basis

Table Schema Definition follows the Metadata Mapping pattern, where the mapping between in-memory data structures (row tuples) and persistent storage structures (database columns) is declared as metadata rather than encoded in imperative code.

The algorithm for schema-driven table creation proceeds as follows:

  1. The task class declares a columns attribute, which is a list of column specifications. The simplest format is a list of (column_name, column_type) tuples for SQL-string-based backends, or a list of ([column_name, SQLAlchemyType], {kwargs}) tuples for ORM-based backends.
  2. When run() executes, it first attempts to copy data into the table.
  3. If the copy fails because the table does not exist (detected by a database error code such as undefined_table in PostgreSQL), the framework resets the connection and calls create_table(connection).
  4. The create_table() method iterates over the columns list, constructs a CREATE TABLE DDL statement with the appropriate column definitions, and executes it.
  5. For ORM-based backends, create_table() may instead construct SQLAlchemy Column objects and use metadata.create_all(engine) to issue the DDL.
  6. After successful creation, the data copy is retried, this time succeeding because the table now exists.
  7. If the reflect flag is set (ORM-based backends), the framework skips DDL generation and instead loads the existing tables schema via metadata.reflect().

This two-attempt strategy (try copy, create table on failure, retry copy) provides a clean separation between the normal-case fast path (table exists) and the first-run setup path (table must be created).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment