Principle:Huggingface Datasets SQL Import
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
SQL Import is the principle of loading data from a SQL database into the HuggingFace Dataset format.
Description
Many organizations store their data in relational databases such as PostgreSQL, MySQL, or SQLite. The SQL Import principle covers executing a SQL query or selecting a table through a database connection, reading the resulting rows, converting them to typed Arrow columns, and producing a cached HuggingFace Dataset. The connection can be provided as a SQLAlchemy connection URI, a SQLAlchemy engine or connection object, or a raw sqlite3.Connection. The underlying Sql builder handles batched fetching and type mapping from database types to Arrow types.
Usage
Use SQL Import when your training or evaluation data resides in a relational database and you want to bring it into the HuggingFace ecosystem without first exporting to an intermediate file format. This is useful for workflows where the authoritative data source is a database and you want to avoid maintaining duplicate copies in CSV or Parquet.
Theoretical Basis
SQL databases store data in a row-oriented format optimized for transactional workloads. Importing SQL query results into a columnar Arrow-backed dataset converts row-oriented tuples into column-oriented arrays, enabling efficient analytical operations. The import pipeline uses pandas read_sql under the hood (via the Sql builder) to fetch rows in batches, then converts each batch into an Arrow table. The resulting tables are concatenated and cached on disk for subsequent zero-copy access.