Implementation:Explodinggradients Ragas Text2SQL Data Utils

Field	Value
source	Explodinggradients_Ragas\|https://github.com/explodinggradients/ragas
domains	Examples, Text2SQL
last_updated	2026-02-10 00:00 GMT

Overview

A CLI utility module for downloading the BookSQL dataset from Hugging Face Hub and creating balanced, optionally validated sample datasets for Text-to-SQL evaluation workflows.

Description

The data_utils.py module provides a complete data preparation pipeline for the Text-to-SQL evaluation example. It includes functionality to download the gated BookSQL dataset from Hugging Face Hub using snapshot_download, load and deduplicate the training data from JSON, sample records across difficulty levels (easy, medium, hard), optionally validate SQL queries by executing them against the SQLite database, and save the final balanced dataset to CSV. The validation step uses execute_and_validate_query from the sibling validate_sql_dataset module to ensure only executable queries are included. The module exposes a CLI via argparse with flags for downloading data (--download-data), creating samples (--create-sample), controlling sample size (--samples), enabling validation (--validate), and filtering to queries that return data (--require-data).

Usage

# Download the BookSQL dataset
python -m ragas_examples.text2sql.data_utils --download-data

# Create a balanced sample with 15 queries per difficulty level
python -m ragas_examples.text2sql.data_utils --create-sample

# Create a validated sample with 5 per level
python -m ragas_examples.text2sql.data_utils --create-sample --samples 5 --validate

# Only include queries that return actual data
python -m ragas_examples.text2sql.data_utils --create-sample --validate --require-data

Code Reference

Field	Value
Source Location	`examples/ragas_examples/text2sql/data_utils.py`
File Size	467 lines

Function Signatures

def download_booksql_dataset() -> bool

def create_sample_dataset(
    input_file: str = "BookSQL-files/BookSQL/train.json",
    output_dir: str = "datasets",
    output_filename: str = "booksql_sample.csv",
    samples_per_level: int = 10,
    random_seed: int = 42,
    validate_queries: bool = False,
    require_data: bool = False
) -> bool

def validate_samples(
    data: DataFrame, level: str, samples_per_level: int,
    random_seed: int, require_data: bool = False
) -> DataFrame

I/O Contract

Function	Input	Output
download_booksql_dataset	None (downloads from `Exploration-Lab/BookSQL` on HF Hub)	`bool` -- True if download succeeded; files written to `./BookSQL-files/`
create_sample_dataset	`input_file` (path to train.json), sampling and validation parameters	`bool` -- True if CSV saved successfully to `output_dir/output_filename`
validate_samples	`data: DataFrame`, difficulty level, sample count, seed, require_data flag	`DataFrame` -- Validated samples for the specified difficulty level
load_and_clean_data	`input_file: str` (path to train.json)	`DataFrame` -- Deduplicated training records
save_results	`data: DataFrame`, output_dir, output_filename, random_seed	`bool` -- True if CSV saved successfully

Usage Examples

from ragas_examples.text2sql.data_utils import (
    create_sample_dataset,
    download_booksql_dataset,
)

# Step 1: Download the dataset
download_booksql_dataset()

# Step 2: Create a validated sample dataset
success = create_sample_dataset(
    samples_per_level=10,
    validate_queries=True,
    require_data=True,
)
print(f"Dataset created: {success}")

Related Pages

Explodinggradients_Ragas_Text2SQL_DB_Utils -- Database utilities used for query execution during validation
Explodinggradients_Ragas_Text2SQL_Validate_Dataset -- Validation module imported by this module
Explodinggradients_Ragas_AG_UI_Experiments_Module -- Another example evaluation pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment