Principle:Spotify Luigi SQL Query Execution
| Knowledge Sources | |
|---|---|
| Domains | Database, SQL_Analytics |
| Last Updated | 2026-02-10 08:00 GMT |
Overview
Executing SQL queries against distributed query engines as pipeline steps for large-scale analytical data processing.
Description
SQL query execution is the practice of embedding SQL queries as pipeline task steps that run against distributed query engines. Engines such as Presto (Trino), Apache Spark SQL, Google BigQuery, and Amazon Athena allow running standard SQL queries over massive datasets stored in distributed file systems, object stores, or federated data sources. In a data pipeline, a task defines a SQL query, submits it to the query engine, monitors execution progress, and treats the query results (or a result table) as the task's output. This allows data engineers to express complex transformations, aggregations, and joins using familiar SQL syntax while leveraging the distributed execution capabilities of the query engine for performance and scalability.
Usage
Use SQL query execution when pipeline transformations can be naturally expressed in SQL, when data resides in distributed storage systems accessible by SQL query engines, when the scale of data requires distributed query processing, or when the organization's data platform is centered around SQL-based analytics tools.
Theoretical Basis
SQL query execution in pipelines combines relational algebra with distributed query processing:
1. Query Definition -- The pipeline task defines a SQL query as a declarative specification of the desired transformation. The query expresses operations from relational algebra: * Selection (WHERE) -- Filter rows matching a predicate * Projection (SELECT) -- Choose specific columns * Join (JOIN) -- Combine rows from multiple tables * Aggregation (GROUP BY) -- Compute summary statistics * Set Operations (UNION, INTERSECT) -- Combine result sets 2. Query Submission -- The query is submitted to the distributed query engine via its client protocol (JDBC, REST API, or native protocol). The engine returns a query identifier for tracking. 3. Query Planning -- The query engine parses the SQL, resolves table references against its catalog, and produces a logical plan. The optimizer then transforms this into a physical plan: * Predicate pushdown -- Push filters closer to data sources to reduce data movement * Join reordering -- Choose optimal join order based on table statistics * Partition pruning -- Skip irrelevant data partitions based on query predicates 4. Distributed Execution -- The physical plan is divided into stages that execute across worker nodes: * Each stage processes a partition of the data in parallel * Intermediate results are shuffled between stages as needed (for joins and aggregations) * The coordinator node orchestrates execution and collects the final result 5. Progress Monitoring -- The pipeline polls the query engine for execution status, tracking metrics such as rows processed, stages completed, and estimated completion time. 6. Result Handling -- Query results are materialized in one of several ways: * Written to a destination table (CREATE TABLE AS SELECT) * Stored in a specified output location (INSERT INTO ... SELECT) * Returned as a result set for further processing 7. Completion Tracking -- The pipeline marks the task as complete once the query finishes successfully, typically by checking for the existence of the output table or a marker file.
The key advantage is declarative specification: the pipeline author describes what data to produce, and the query engine determines how to compute it efficiently across a distributed cluster.