Implementation:Huggingface Datasets Sql Builder

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Loading, Database
Last Updated	2026-02-14 18:00 GMT

Overview

Packaged dataset builder for loading data from SQL databases into Arrow-backed datasets provided by the HuggingFace Datasets library.

Description

Sql is a packaged dataset builder extending ArrowBasedBuilder that reads data from SQL databases using pandas.read_sql(). It is configured via SqlConfig, a dataclass extending BuilderConfig, with fields for sql (a SQL query string or SQLAlchemy Selectable), con (a database connection URI, SQLAlchemy engine/connection, or sqlite3.Connection), and standard pandas read_sql parameters: index_col, coerce_float (default True), params, parse_dates, columns, chunksize (default 10,000), and features.

Both sql and con are required and validated in __post_init__. The create_config_id method handles deterministic hashing by stringifying SQLAlchemy Selectable objects and converting non-string connections to their object IDs. A pd_read_sql_kwargs property assembles the keyword arguments forwarded to pandas.read_sql().

The builder produces a single TRAIN split. The _generate_tables method calls pd.read_sql() with the configured chunksize, iterates over the resulting chunks, converts each DataFrame to an Arrow table, and applies optional feature casting. The _cast_table method uses a cheaper cast when no storage-level feature transformations are needed, and falls back to table_cast for more complex conversions.

Usage

Use Sql via load_dataset("sql", sql=query, con=connection) to load data from any SQL database supported by pandas and SQLAlchemy. The builder supports both raw SQL strings and SQLAlchemy expression objects.

Code Reference

Source Location

Repository: datasets
File: src/datasets/packaged_modules/sql/sql.py
Lines: 1-120

Signature

@dataclass
class SqlConfig(datasets.BuilderConfig):
    """BuilderConfig for SQL."""
    sql: Union[str, "sqlalchemy.sql.Selectable"] = None
    con: Union[str, "sqlalchemy.engine.Connection", "sqlalchemy.engine.Engine", "sqlite3.Connection"] = None
    index_col: Optional[Union[str, list[str]]] = None
    coerce_float: bool = True
    params: Optional[Union[list, tuple, dict]] = None
    parse_dates: Optional[Union[list, dict]] = None
    columns: Optional[list[str]] = None
    chunksize: Optional[int] = 10_000
    features: Optional[datasets.Features] = None


class Sql(datasets.ArrowBasedBuilder):
    BUILDER_CONFIG_CLASS = SqlConfig

    def _info(self):
    def _split_generators(self, dl_manager):
    def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    def _generate_tables(self):

Import

from datasets.packaged_modules.sql.sql import Sql, SqlConfig

I/O Contract

Inputs (SqlConfig)

Name	Type	Required	Description
sql	`Union[str, Selectable]`	Yes	SQL query string or SQLAlchemy `Selectable` expression to execute. Raises `ValueError` if not provided.
con	`Union[str, Connection, Engine, sqlite3.Connection]`	Yes	Database connection. Can be a URI string, SQLAlchemy engine/connection, or `sqlite3.Connection`. Raises `ValueError` if not provided.
index_col	`Optional[Union[str, list[str]]]`	No	Column(s) to set as the DataFrame index.
coerce_float	`bool`	No	Whether to coerce non-string, non-numeric objects to floats. Defaults to `True`.
params	`Optional[Union[list, tuple, dict]]`	No	Parameters to pass to the SQL query for parameterized execution.
parse_dates	`Optional[Union[list, dict]]`	No	Column names to parse as dates.
columns	`Optional[list[str]]`	No	List of column names to select from the SQL result.
chunksize	`Optional[int]`	No	Number of rows per chunk when reading from the database. Defaults to 10,000.
features	`Optional[Features]`	No	Schema describing the dataset features. If provided, the Arrow table is cast to match.

Outputs

Name	Type	Description
dataset	`Dataset`	An Arrow-backed Dataset containing the SQL query results, always assigned to the `TRAIN` split.

Usage Examples

Basic Usage with SQLite

import sqlite3
from datasets import load_dataset

# Connect to SQLite database
con = sqlite3.connect("my_database.db")

# Load data from SQL query
ds = load_dataset(
    "sql",
    sql="SELECT * FROM my_table",
    con=con,
    split="train",
)
print(ds[0])

Using a SQLAlchemy Connection URI

from datasets import load_dataset

# Load from PostgreSQL using URI
ds = load_dataset(
    "sql",
    sql="SELECT id, text, label FROM articles WHERE year >= 2020",
    con="postgresql://user:password@localhost:5432/mydb",
    split="train",
)

Parameterized Query

from datasets import load_dataset

ds = load_dataset(
    "sql",
    sql="SELECT * FROM users WHERE age > :min_age",
    con="sqlite:///users.db",
    params={"min_age": 18},
    split="train",
)

Related Pages

Implements Principle

Principle:Huggingface_Datasets_SQL_Dataset_Building

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment