Principle:Haifengl Smile SQL Data Ingestion
Overview
SQL Data Ingestion is the principle of loading external data sources -- such as CSV, Parquet, JSON, and Apache Iceberg tables -- directly into an in-process SQL database without intermediate ETL (Extract, Transform, Load) steps. The database engine reads files natively, creating in-memory tables that can immediately be queried with standard SQL.
In the Smile library, this principle is realized through the SQL class's file-loading methods, which delegate to DuckDB's built-in file-scanning operators to read heterogeneous data sources and register them as queryable tables.
Theoretical Basis
Traditional ETL vs. Direct Ingestion
In a traditional data pipeline, loading data into a database involves three distinct stages:
| Stage | Traditional ETL | Direct Ingestion |
|---|---|---|
| Extract | Read files using a separate library (e.g., Apache Commons CSV). | The database engine reads the file directly. |
| Transform | Parse data types, handle nulls, apply schema. | The engine infers or applies schema during scan. |
| Load | Insert rows into the database via INSERT statements. | CREATE TABLE AS SELECT * FROM read_csv(...) in a single operation.
|
Direct ingestion collapses these three stages into a single operation handled entirely by the database engine. This provides several advantages:
- Reduced memory copies: Data flows from disk to the database's internal columnar format without passing through intermediate Java objects.
- Parallel I/O: The database engine can use multiple threads to read and parse files concurrently.
- Schema inference: The engine can automatically detect column types from file headers and sample data.
File Format Considerations
Different file formats have different properties that affect ingestion performance:
| Format | Storage Model | Schema | Compression | Typical Use |
|---|---|---|---|---|
| CSV | Row-oriented text | Header-based or explicit | None (raw text) | Simple data exchange, spreadsheets |
| Parquet | Columnar binary | Embedded in file metadata | Snappy, Gzip, ZSTD | Data lakes, large-scale analytics |
| JSON | Semi-structured text | Inferred from structure | None (raw text) | APIs, log data, nested data |
| Iceberg | Columnar (Parquet underneath) | Managed by Iceberg catalog | Inherited from Parquet | Versioned data lakes, time-travel queries |
Parquet is the most efficient format for analytical ingestion because:
- Its columnar layout matches the database engine's internal storage model, minimizing transformation.
- Embedded statistics (min/max per column chunk) enable predicate pushdown -- the engine can skip irrelevant data blocks.
- Built-in compression reduces I/O bandwidth requirements.
CSV requires the most processing because every value must be parsed from text to its target type, and there is no embedded schema.
Glob Pattern Matching
Modern analytical databases support glob pattern matching for file paths, enabling multiple files to be treated as a single logical table. For example:
data/sales_*.csv -- All CSV files matching the pattern data/year=2024/** -- Hive-partitioned directory structure
This is essential for partitioned datasets where data is split across many files (common in data lake architectures). The database engine unions the files automatically, provided they share the same schema.
Fluent Builder Pattern
The data loading methods in Smile's SQL class return this (the SQL instance itself), enabling method chaining in a fluent builder style:
db.csv("customers", "customers.csv")
.parquet("orders", "orders/*.parquet")
.csv("products", "products.csv")
This pattern, rooted in the Builder design pattern, creates readable, self-documenting data pipeline setup code where the sequence of data loading steps reads like a declarative specification.
Relevance to Machine Learning
Data ingestion is the first step in any ML pipeline. The efficiency and flexibility of ingestion directly impacts:
- Iteration speed: Faster ingestion means faster experiment cycles during model development.
- Data diversity: Supporting multiple formats means ML practitioners can work with data from various sources without preprocessing.
- Reproducibility: Loading data via SQL statements creates a declarative, auditable record of what data was used for training.
Knowledge Sources
Domains
Data_Engineering, SQL, Analytics, ETL
Related
- Implementation:Haifengl_Smile_SQL_Data_Loaders - The concrete API methods (csv, parquet, iceberg, json) that implement data ingestion.
- Principle:Haifengl_Smile_SQL_Database_Connection - A database connection must be established before data can be ingested.
- Principle:Haifengl_Smile_SQL_Metadata_Exploration - After ingestion, metadata exploration confirms the schema of loaded tables.