Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Haifengl Smile SQL Data Ingestion

From Leeroopedia


Overview

SQL Data Ingestion is the principle of loading external data sources -- such as CSV, Parquet, JSON, and Apache Iceberg tables -- directly into an in-process SQL database without intermediate ETL (Extract, Transform, Load) steps. The database engine reads files natively, creating in-memory tables that can immediately be queried with standard SQL.

In the Smile library, this principle is realized through the SQL class's file-loading methods, which delegate to DuckDB's built-in file-scanning operators to read heterogeneous data sources and register them as queryable tables.

Theoretical Basis

Traditional ETL vs. Direct Ingestion

In a traditional data pipeline, loading data into a database involves three distinct stages:

Stage Traditional ETL Direct Ingestion
Extract Read files using a separate library (e.g., Apache Commons CSV). The database engine reads the file directly.
Transform Parse data types, handle nulls, apply schema. The engine infers or applies schema during scan.
Load Insert rows into the database via INSERT statements. CREATE TABLE AS SELECT * FROM read_csv(...) in a single operation.

Direct ingestion collapses these three stages into a single operation handled entirely by the database engine. This provides several advantages:

  • Reduced memory copies: Data flows from disk to the database's internal columnar format without passing through intermediate Java objects.
  • Parallel I/O: The database engine can use multiple threads to read and parse files concurrently.
  • Schema inference: The engine can automatically detect column types from file headers and sample data.

File Format Considerations

Different file formats have different properties that affect ingestion performance:

Format Storage Model Schema Compression Typical Use
CSV Row-oriented text Header-based or explicit None (raw text) Simple data exchange, spreadsheets
Parquet Columnar binary Embedded in file metadata Snappy, Gzip, ZSTD Data lakes, large-scale analytics
JSON Semi-structured text Inferred from structure None (raw text) APIs, log data, nested data
Iceberg Columnar (Parquet underneath) Managed by Iceberg catalog Inherited from Parquet Versioned data lakes, time-travel queries

Parquet is the most efficient format for analytical ingestion because:

  • Its columnar layout matches the database engine's internal storage model, minimizing transformation.
  • Embedded statistics (min/max per column chunk) enable predicate pushdown -- the engine can skip irrelevant data blocks.
  • Built-in compression reduces I/O bandwidth requirements.

CSV requires the most processing because every value must be parsed from text to its target type, and there is no embedded schema.

Glob Pattern Matching

Modern analytical databases support glob pattern matching for file paths, enabling multiple files to be treated as a single logical table. For example:

data/sales_*.csv       -- All CSV files matching the pattern
data/year=2024/**      -- Hive-partitioned directory structure

This is essential for partitioned datasets where data is split across many files (common in data lake architectures). The database engine unions the files automatically, provided they share the same schema.

Fluent Builder Pattern

The data loading methods in Smile's SQL class return this (the SQL instance itself), enabling method chaining in a fluent builder style:

db.csv("customers", "customers.csv")
  .parquet("orders", "orders/*.parquet")
  .csv("products", "products.csv")

This pattern, rooted in the Builder design pattern, creates readable, self-documenting data pipeline setup code where the sequence of data loading steps reads like a declarative specification.

Relevance to Machine Learning

Data ingestion is the first step in any ML pipeline. The efficiency and flexibility of ingestion directly impacts:

  • Iteration speed: Faster ingestion means faster experiment cycles during model development.
  • Data diversity: Supporting multiple formats means ML practitioners can work with data from various sources without preprocessing.
  • Reproducibility: Loading data via SQL statements creates a declarative, auditable record of what data was used for training.

Knowledge Sources

Domains

Data_Engineering, SQL, Analytics, ETL

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment