Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Eventual Inc Daft SQL Query Analytics

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Analytics, SQL
Last Updated 2026-02-08 14:00 GMT

Overview

End-to-end process for running SQL-based analytical queries on data from multiple catalogs and table formats using Daft's built-in SQL engine with session and catalog management.

Description

This workflow covers using Daft as a SQL query engine for interactive analytics and data exploration. Daft provides a Rust-based SQL parser and planner that translates SQL queries into optimized Daft execution plans. Users register DataFrames or external catalog tables, then query them using standard SQL syntax including SELECT, JOIN, GROUP BY, HAVING, ORDER BY, window functions, and subqueries. The session system manages catalog state (Iceberg, Glue, Unity, Gravitino, PostgreSQL, Memory) and namespace scoping, allowing SQL queries to reference tables across multiple catalogs. SQL table functions (read_parquet, read_csv, read_json, read_deltalake) enable querying files directly without prior registration.

Usage

Execute this workflow when you prefer SQL syntax for data analysis or need to integrate Daft with SQL-based tooling. Typical triggers include:

  • Interactive data exploration using familiar SQL syntax
  • Querying across multiple catalog sources (Iceberg + PostgreSQL + in-memory) in a single session
  • Building analytical dashboards or reports from lakehouse tables
  • Running TPC-H or similar analytical benchmark queries
  • Migrating existing SQL-based pipelines to Daft's execution engine
  • Combining SQL queries with DataFrame API operations in a single pipeline

Execution Steps

Step 1: Session and Catalog Setup

Initialize a Daft session and attach the external catalogs that contain the tables you want to query. Catalogs provide namespace-scoped table discovery and metadata management. Daft supports Iceberg REST catalogs, AWS Glue, Unity Catalog, Apache Gravitino, PostgreSQL, S3 Tables, and in-memory catalogs.

Key considerations:

  • Use daft.Session() to create an isolated session or daft.current_session() for the default
  • Attach catalogs with daft.attach_catalog() or session.attach_catalog()
  • Set the active catalog and namespace with set_catalog() and set_namespace()
  • For quick ad-hoc analysis, register DataFrames directly with create_temp_table()
  • Catalogs are identified by name and can be referenced in SQL with dot notation (catalog.namespace.table)

Step 2: Table Registration

Make data sources available to the SQL engine by registering them as named tables. Tables can come from attached catalogs (auto-discovered), from DataFrames registered as temporary tables, or from SQL table functions that read files on the fly.

Key considerations:

  • Catalog tables are automatically available after catalog attachment
  • Use create_temp_table("name", dataframe) to register a DataFrame for SQL access
  • Use create_table("name", dataframe) for persistent table creation in writable catalogs
  • Use SHOW TABLES to list available tables in the current namespace
  • Use fully qualified names (catalog.namespace.table) to query across catalogs

Step 3: SQL Query Composition

Write SQL queries using Daft's supported SQL syntax. Daft supports standard SELECT statements with projections, filters (WHERE), aggregations (GROUP BY, HAVING), joins (INNER, LEFT, RIGHT, FULL, CROSS, SEMI, ANTI), set operations (UNION, INTERSECT, EXCEPT), subqueries, window functions (ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, running aggregates), and ORDER BY with LIMIT/OFFSET.

Key considerations:

  • Use daft.sql("SELECT ...") to execute queries and get a DataFrame result
  • SQL table functions like read_parquet(), read_csv(), read_json() allow querying files directly
  • Window functions support PARTITION BY, ORDER BY, and frame specifications (ROWS, RANGE)
  • NULL handling follows SQL semantics with IS NULL, IS NOT NULL, COALESCE, and null-safe equals (<=>)
  • User-defined functions registered with attach_function() are callable from SQL
  • Use sql_expr() to embed SQL expressions within DataFrame API calls

Step 4: Query Execution and Result Inspection

Execute the SQL query and inspect results. The sql() function returns a lazy DataFrame that can be further transformed with the DataFrame API, materialized with collect(), or displayed with show(). Use explain() to review the optimized query plan.

Key considerations:

  • daft.sql() returns a DataFrame (lazy until materialized)
  • Chain DataFrame operations after sql() for hybrid SQL-DataFrame workflows
  • Use show(N) to preview the first N rows without full materialization
  • Use explain() to see how the SQL query was optimized
  • Use collect() to materialize the full result into memory

Step 5: Result Export

Export query results to the desired format. Write to lakehouse tables (Iceberg, Delta Lake), file formats (Parquet, CSV, JSON), or convert to in-memory representations (Pandas, Arrow, Python lists).

Key considerations:

  • Use write_parquet() or write_csv() for file-based output
  • Use write_iceberg() or write_deltalake() for lakehouse table output
  • Use to_pandas() or to_arrow() for in-memory conversion
  • Use to_pylist() or to_pydict() for Python-native data structures
  • Use write_table() with session catalogs for catalog-managed writes

Execution Diagram

GitHub URL

Workflow Repository