Principle:Spotify Luigi Table Schema Definition
| Knowledge Sources | Spotify Luigi Repository |
|---|---|
| Domains | Pipeline_Orchestration, Database, ETL |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Table Schema Definition is the practice of declaring the column names, data types, and constraints of a database table within the pipeline task itself, so that the target table can be automatically created or validated before data is loaded.
Description
In a database ingestion pipeline, the destination table must exist and have a compatible schema before data can be inserted. Table Schema Definition addresses how the pipeline task declares the expected structure of the target table -- column names, their SQL types, and optional constraints -- as a first-class part of the tasks configuration.
This principle involves several interrelated concerns:
- Declarative column specification: The task class carries a
columnsattribute that lists each columns name and type. This makes the schema self-documenting and co-located with the data loading logic, rather than scattered across separate DDL scripts. - Automatic table creation: When the target table does not yet exist, the framework uses the declared column definitions to generate and execute a
CREATE TABLEstatement. This eliminates manual schema migration steps for new tables. - Graceful handling of existing tables: If the table already exists, the creation step is either skipped or the existing schema is reflected (loaded from the database), ensuring that the pipeline does not fail on repeated runs.
- Format flexibility: Different backends may accept different formats for column specifications. A simple RDBMS backend might accept tuples of
(name, type_string), while an ORM-based backend might accept richer constructs with keyword arguments for constraints, indexes, and defaults. - Schema-data contract: The column definitions serve as a contract between the data producer (the
rows()method) and the data consumer (the database table). Each row must yield values in the same order and of compatible types as declared incolumns.
The general pattern is: the task declares columns, the create_table() method reads those declarations and translates them into the backend-appropriate DDL, and the run() method calls create_table() if the table is missing before proceeding to load data.
Usage
Use Table Schema Definition when:
- Your pipeline creates new database tables as part of its output, and you want the table structure to be defined alongside the task code rather than maintained separately.
- You need automatic table creation on first run so that new environments (development, staging) can bootstrap without manual DDL execution.
- You want the schema to serve as documentation of the expected data format for downstream consumers.
- Your pipeline must support multiple database backends, each of which may have different DDL syntax, and you want a single declarative format that the framework translates per backend.
- You are using a reflect-capable ORM layer and want to bind to an existing tables schema at runtime rather than redeclaring it.
Theoretical Basis
Table Schema Definition follows the Metadata Mapping pattern, where the mapping between in-memory data structures (row tuples) and persistent storage structures (database columns) is declared as metadata rather than encoded in imperative code.
The algorithm for schema-driven table creation proceeds as follows:
- The task class declares a
columnsattribute, which is a list of column specifications. The simplest format is a list of(column_name, column_type)tuples for SQL-string-based backends, or a list of([column_name, SQLAlchemyType], {kwargs})tuples for ORM-based backends. - When
run()executes, it first attempts to copy data into the table. - If the copy fails because the table does not exist (detected by a database error code such as
undefined_tablein PostgreSQL), the framework resets the connection and callscreate_table(connection). - The
create_table()method iterates over thecolumnslist, constructs aCREATE TABLEDDL statement with the appropriate column definitions, and executes it. - For ORM-based backends,
create_table()may instead construct SQLAlchemyColumnobjects and usemetadata.create_all(engine)to issue the DDL. - After successful creation, the data copy is retried, this time succeeding because the table now exists.
- If the
reflectflag is set (ORM-based backends), the framework skips DDL generation and instead loads the existing tables schema viametadata.reflect().
This two-attempt strategy (try copy, create table on failure, retry copy) provides a clean separation between the normal-case fast path (table exists) and the first-run setup path (table must be created).