Implementation:Explodinggradients Ragas Text2SQL Data Utils
| Field | Value |
|---|---|
| source | Explodinggradients_Ragas|https://github.com/explodinggradients/ragas |
| domains | Examples, Text2SQL |
| last_updated | 2026-02-10 00:00 GMT |
Overview
A CLI utility module for downloading the BookSQL dataset from Hugging Face Hub and creating balanced, optionally validated sample datasets for Text-to-SQL evaluation workflows.
Description
The data_utils.py module provides a complete data preparation pipeline for the Text-to-SQL evaluation example. It includes functionality to download the gated BookSQL dataset from Hugging Face Hub using snapshot_download, load and deduplicate the training data from JSON, sample records across difficulty levels (easy, medium, hard), optionally validate SQL queries by executing them against the SQLite database, and save the final balanced dataset to CSV. The validation step uses execute_and_validate_query from the sibling validate_sql_dataset module to ensure only executable queries are included. The module exposes a CLI via argparse with flags for downloading data (--download-data), creating samples (--create-sample), controlling sample size (--samples), enabling validation (--validate), and filtering to queries that return data (--require-data).
Usage
# Download the BookSQL dataset
python -m ragas_examples.text2sql.data_utils --download-data
# Create a balanced sample with 15 queries per difficulty level
python -m ragas_examples.text2sql.data_utils --create-sample
# Create a validated sample with 5 per level
python -m ragas_examples.text2sql.data_utils --create-sample --samples 5 --validate
# Only include queries that return actual data
python -m ragas_examples.text2sql.data_utils --create-sample --validate --require-data
Code Reference
| Field | Value |
|---|---|
| Source Location | examples/ragas_examples/text2sql/data_utils.py
|
| File Size | 467 lines |
Function Signatures
def download_booksql_dataset() -> bool
def create_sample_dataset(
input_file: str = "BookSQL-files/BookSQL/train.json",
output_dir: str = "datasets",
output_filename: str = "booksql_sample.csv",
samples_per_level: int = 10,
random_seed: int = 42,
validate_queries: bool = False,
require_data: bool = False
) -> bool
def validate_samples(
data: DataFrame, level: str, samples_per_level: int,
random_seed: int, require_data: bool = False
) -> DataFrame
I/O Contract
| Function | Input | Output |
|---|---|---|
| download_booksql_dataset | None (downloads from Exploration-Lab/BookSQL on HF Hub) |
bool -- True if download succeeded; files written to ./BookSQL-files/
|
| create_sample_dataset | input_file (path to train.json), sampling and validation parameters |
bool -- True if CSV saved successfully to output_dir/output_filename
|
| validate_samples | data: DataFrame, difficulty level, sample count, seed, require_data flag |
DataFrame -- Validated samples for the specified difficulty level
|
| load_and_clean_data | input_file: str (path to train.json) |
DataFrame -- Deduplicated training records
|
| save_results | data: DataFrame, output_dir, output_filename, random_seed |
bool -- True if CSV saved successfully
|
Usage Examples
from ragas_examples.text2sql.data_utils import (
create_sample_dataset,
download_booksql_dataset,
)
# Step 1: Download the dataset
download_booksql_dataset()
# Step 2: Create a validated sample dataset
success = create_sample_dataset(
samples_per_level=10,
validate_queries=True,
require_data=True,
)
print(f"Dataset created: {success}")
Related Pages
- Explodinggradients_Ragas_Text2SQL_DB_Utils -- Database utilities used for query execution during validation
- Explodinggradients_Ragas_Text2SQL_Validate_Dataset -- Validation module imported by this module
- Explodinggradients_Ragas_AG_UI_Experiments_Module -- Another example evaluation pipeline