Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Sql Builder

From Leeroopedia
Knowledge Sources
Domains Data_Loading, Database
Last Updated 2026-02-14 18:00 GMT

Overview

Packaged dataset builder for loading data from SQL databases into Arrow-backed datasets provided by the HuggingFace Datasets library.

Description

Sql is a packaged dataset builder extending ArrowBasedBuilder that reads data from SQL databases using pandas.read_sql(). It is configured via SqlConfig, a dataclass extending BuilderConfig, with fields for sql (a SQL query string or SQLAlchemy Selectable), con (a database connection URI, SQLAlchemy engine/connection, or sqlite3.Connection), and standard pandas read_sql parameters: index_col, coerce_float (default True), params, parse_dates, columns, chunksize (default 10,000), and features.

Both sql and con are required and validated in __post_init__. The create_config_id method handles deterministic hashing by stringifying SQLAlchemy Selectable objects and converting non-string connections to their object IDs. A pd_read_sql_kwargs property assembles the keyword arguments forwarded to pandas.read_sql().

The builder produces a single TRAIN split. The _generate_tables method calls pd.read_sql() with the configured chunksize, iterates over the resulting chunks, converts each DataFrame to an Arrow table, and applies optional feature casting. The _cast_table method uses a cheaper cast when no storage-level feature transformations are needed, and falls back to table_cast for more complex conversions.

Usage

Use Sql via load_dataset("sql", sql=query, con=connection) to load data from any SQL database supported by pandas and SQLAlchemy. The builder supports both raw SQL strings and SQLAlchemy expression objects.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/packaged_modules/sql/sql.py
  • Lines: 1-120

Signature

@dataclass
class SqlConfig(datasets.BuilderConfig):
    """BuilderConfig for SQL."""
    sql: Union[str, "sqlalchemy.sql.Selectable"] = None
    con: Union[str, "sqlalchemy.engine.Connection", "sqlalchemy.engine.Engine", "sqlite3.Connection"] = None
    index_col: Optional[Union[str, list[str]]] = None
    coerce_float: bool = True
    params: Optional[Union[list, tuple, dict]] = None
    parse_dates: Optional[Union[list, dict]] = None
    columns: Optional[list[str]] = None
    chunksize: Optional[int] = 10_000
    features: Optional[datasets.Features] = None


class Sql(datasets.ArrowBasedBuilder):
    BUILDER_CONFIG_CLASS = SqlConfig

    def _info(self):
    def _split_generators(self, dl_manager):
    def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    def _generate_tables(self):

Import

from datasets.packaged_modules.sql.sql import Sql, SqlConfig

I/O Contract

Inputs (SqlConfig)

Name Type Required Description
sql Union[str, Selectable] Yes SQL query string or SQLAlchemy Selectable expression to execute. Raises ValueError if not provided.
con Union[str, Connection, Engine, sqlite3.Connection] Yes Database connection. Can be a URI string, SQLAlchemy engine/connection, or sqlite3.Connection. Raises ValueError if not provided.
index_col Optional[Union[str, list[str]]] No Column(s) to set as the DataFrame index.
coerce_float bool No Whether to coerce non-string, non-numeric objects to floats. Defaults to True.
params Optional[Union[list, tuple, dict]] No Parameters to pass to the SQL query for parameterized execution.
parse_dates Optional[Union[list, dict]] No Column names to parse as dates.
columns Optional[list[str]] No List of column names to select from the SQL result.
chunksize Optional[int] No Number of rows per chunk when reading from the database. Defaults to 10,000.
features Optional[Features] No Schema describing the dataset features. If provided, the Arrow table is cast to match.

Outputs

Name Type Description
dataset Dataset An Arrow-backed Dataset containing the SQL query results, always assigned to the TRAIN split.

Usage Examples

Basic Usage with SQLite

import sqlite3
from datasets import load_dataset

# Connect to SQLite database
con = sqlite3.connect("my_database.db")

# Load data from SQL query
ds = load_dataset(
    "sql",
    sql="SELECT * FROM my_table",
    con=con,
    split="train",
)
print(ds[0])

Using a SQLAlchemy Connection URI

from datasets import load_dataset

# Load from PostgreSQL using URI
ds = load_dataset(
    "sql",
    sql="SELECT id, text, label FROM articles WHERE year >= 2020",
    con="postgresql://user:password@localhost:5432/mydb",
    split="train",
)

Parameterized Query

from datasets import load_dataset

ds = load_dataset(
    "sql",
    sql="SELECT * FROM users WHERE age > :min_age",
    con="sqlite:///users.db",
    params={"min_age": 18},
    split="train",
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment