Principle:Haifengl Smile SQL Data Ingestion

Overview

SQL Data Ingestion is the principle of loading external data sources -- such as CSV, Parquet, JSON, and Apache Iceberg tables -- directly into an in-process SQL database without intermediate ETL (Extract, Transform, Load) steps. The database engine reads files natively, creating in-memory tables that can immediately be queried with standard SQL.

In the Smile library, this principle is realized through the SQL class's file-loading methods, which delegate to DuckDB's built-in file-scanning operators to read heterogeneous data sources and register them as queryable tables.

Theoretical Basis

Traditional ETL vs. Direct Ingestion

In a traditional data pipeline, loading data into a database involves three distinct stages:

Stage	Traditional ETL	Direct Ingestion
Extract	Read files using a separate library (e.g., Apache Commons CSV).	The database engine reads the file directly.
Transform	Parse data types, handle nulls, apply schema.	The engine infers or applies schema during scan.
Load	Insert rows into the database via INSERT statements.	`CREATE TABLE AS SELECT * FROM read_csv(...)` in a single operation.

Direct ingestion collapses these three stages into a single operation handled entirely by the database engine. This provides several advantages:

Reduced memory copies: Data flows from disk to the database's internal columnar format without passing through intermediate Java objects.
Parallel I/O: The database engine can use multiple threads to read and parse files concurrently.
Schema inference: The engine can automatically detect column types from file headers and sample data.

File Format Considerations

Different file formats have different properties that affect ingestion performance:

Format	Storage Model	Schema	Compression	Typical Use
CSV	Row-oriented text	Header-based or explicit	None (raw text)	Simple data exchange, spreadsheets
Parquet	Columnar binary	Embedded in file metadata	Snappy, Gzip, ZSTD	Data lakes, large-scale analytics
JSON	Semi-structured text	Inferred from structure	None (raw text)	APIs, log data, nested data
Iceberg	Columnar (Parquet underneath)	Managed by Iceberg catalog	Inherited from Parquet	Versioned data lakes, time-travel queries

Parquet is the most efficient format for analytical ingestion because:

Its columnar layout matches the database engine's internal storage model, minimizing transformation.
Embedded statistics (min/max per column chunk) enable predicate pushdown -- the engine can skip irrelevant data blocks.
Built-in compression reduces I/O bandwidth requirements.

CSV requires the most processing because every value must be parsed from text to its target type, and there is no embedded schema.

Glob Pattern Matching

Modern analytical databases support glob pattern matching for file paths, enabling multiple files to be treated as a single logical table. For example:

data/sales_*.csv       -- All CSV files matching the pattern
data/year=2024/**      -- Hive-partitioned directory structure

This is essential for partitioned datasets where data is split across many files (common in data lake architectures). The database engine unions the files automatically, provided they share the same schema.

Fluent Builder Pattern

The data loading methods in Smile's SQL class return this (the SQL instance itself), enabling method chaining in a fluent builder style:

db.csv("customers", "customers.csv")
  .parquet("orders", "orders/*.parquet")
  .csv("products", "products.csv")

This pattern, rooted in the Builder design pattern, creates readable, self-documenting data pipeline setup code where the sequence of data loading steps reads like a declarative specification.

Relevance to Machine Learning

Data ingestion is the first step in any ML pipeline. The efficiency and flexibility of ingestion directly impacts:

Iteration speed: Faster ingestion means faster experiment cycles during model development.
Data diversity: Supporting multiple formats means ML practitioners can work with data from various sources without preprocessing.
Reproducibility: Loading data via SQL statements creates a declarative, auditable record of what data was used for training.

Knowledge Sources

Domains

Data_Engineering, SQL, Analytics, ETL

Implementation:Haifengl_Smile_SQL_Data_Loaders - The concrete API methods (csv, parquet, iceberg, json) that implement data ingestion.
Principle:Haifengl_Smile_SQL_Database_Connection - A database connection must be established before data can be ingested.
Principle:Haifengl_Smile_SQL_Metadata_Exploration - After ingestion, metadata exploration confirms the schema of loaded tables.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment