Implementation:Huggingface Datasets Sql Builder
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Database |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Packaged dataset builder for loading data from SQL databases into Arrow-backed datasets provided by the HuggingFace Datasets library.
Description
Sql is a packaged dataset builder extending ArrowBasedBuilder that reads data from SQL databases using pandas.read_sql(). It is configured via SqlConfig, a dataclass extending BuilderConfig, with fields for sql (a SQL query string or SQLAlchemy Selectable), con (a database connection URI, SQLAlchemy engine/connection, or sqlite3.Connection), and standard pandas read_sql parameters: index_col, coerce_float (default True), params, parse_dates, columns, chunksize (default 10,000), and features.
Both sql and con are required and validated in __post_init__. The create_config_id method handles deterministic hashing by stringifying SQLAlchemy Selectable objects and converting non-string connections to their object IDs. A pd_read_sql_kwargs property assembles the keyword arguments forwarded to pandas.read_sql().
The builder produces a single TRAIN split. The _generate_tables method calls pd.read_sql() with the configured chunksize, iterates over the resulting chunks, converts each DataFrame to an Arrow table, and applies optional feature casting. The _cast_table method uses a cheaper cast when no storage-level feature transformations are needed, and falls back to table_cast for more complex conversions.
Usage
Use Sql via load_dataset("sql", sql=query, con=connection) to load data from any SQL database supported by pandas and SQLAlchemy. The builder supports both raw SQL strings and SQLAlchemy expression objects.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/packaged_modules/sql/sql.py - Lines: 1-120
Signature
@dataclass
class SqlConfig(datasets.BuilderConfig):
"""BuilderConfig for SQL."""
sql: Union[str, "sqlalchemy.sql.Selectable"] = None
con: Union[str, "sqlalchemy.engine.Connection", "sqlalchemy.engine.Engine", "sqlite3.Connection"] = None
index_col: Optional[Union[str, list[str]]] = None
coerce_float: bool = True
params: Optional[Union[list, tuple, dict]] = None
parse_dates: Optional[Union[list, dict]] = None
columns: Optional[list[str]] = None
chunksize: Optional[int] = 10_000
features: Optional[datasets.Features] = None
class Sql(datasets.ArrowBasedBuilder):
BUILDER_CONFIG_CLASS = SqlConfig
def _info(self):
def _split_generators(self, dl_manager):
def _cast_table(self, pa_table: pa.Table) -> pa.Table:
def _generate_tables(self):
Import
from datasets.packaged_modules.sql.sql import Sql, SqlConfig
I/O Contract
Inputs (SqlConfig)
| Name | Type | Required | Description |
|---|---|---|---|
| sql | Union[str, Selectable] |
Yes | SQL query string or SQLAlchemy Selectable expression to execute. Raises ValueError if not provided.
|
| con | Union[str, Connection, Engine, sqlite3.Connection] |
Yes | Database connection. Can be a URI string, SQLAlchemy engine/connection, or sqlite3.Connection. Raises ValueError if not provided.
|
| index_col | Optional[Union[str, list[str]]] |
No | Column(s) to set as the DataFrame index. |
| coerce_float | bool |
No | Whether to coerce non-string, non-numeric objects to floats. Defaults to True.
|
| params | Optional[Union[list, tuple, dict]] |
No | Parameters to pass to the SQL query for parameterized execution. |
| parse_dates | Optional[Union[list, dict]] |
No | Column names to parse as dates. |
| columns | Optional[list[str]] |
No | List of column names to select from the SQL result. |
| chunksize | Optional[int] |
No | Number of rows per chunk when reading from the database. Defaults to 10,000. |
| features | Optional[Features] |
No | Schema describing the dataset features. If provided, the Arrow table is cast to match. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset |
An Arrow-backed Dataset containing the SQL query results, always assigned to the TRAIN split.
|
Usage Examples
Basic Usage with SQLite
import sqlite3
from datasets import load_dataset
# Connect to SQLite database
con = sqlite3.connect("my_database.db")
# Load data from SQL query
ds = load_dataset(
"sql",
sql="SELECT * FROM my_table",
con=con,
split="train",
)
print(ds[0])
Using a SQLAlchemy Connection URI
from datasets import load_dataset
# Load from PostgreSQL using URI
ds = load_dataset(
"sql",
sql="SELECT id, text, label FROM articles WHERE year >= 2020",
con="postgresql://user:password@localhost:5432/mydb",
split="train",
)
Parameterized Query
from datasets import load_dataset
ds = load_dataset(
"sql",
sql="SELECT * FROM users WHERE age > :min_age",
con="sqlite:///users.db",
params={"min_age": 18},
split="train",
)