Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Explodinggradients Ragas Text2SQL Data Utils

From Leeroopedia


Field Value
source Explodinggradients_Ragas|https://github.com/explodinggradients/ragas
domains Examples, Text2SQL
last_updated 2026-02-10 00:00 GMT

Overview

A CLI utility module for downloading the BookSQL dataset from Hugging Face Hub and creating balanced, optionally validated sample datasets for Text-to-SQL evaluation workflows.

Description

The data_utils.py module provides a complete data preparation pipeline for the Text-to-SQL evaluation example. It includes functionality to download the gated BookSQL dataset from Hugging Face Hub using snapshot_download, load and deduplicate the training data from JSON, sample records across difficulty levels (easy, medium, hard), optionally validate SQL queries by executing them against the SQLite database, and save the final balanced dataset to CSV. The validation step uses execute_and_validate_query from the sibling validate_sql_dataset module to ensure only executable queries are included. The module exposes a CLI via argparse with flags for downloading data (--download-data), creating samples (--create-sample), controlling sample size (--samples), enabling validation (--validate), and filtering to queries that return data (--require-data).

Usage

# Download the BookSQL dataset
python -m ragas_examples.text2sql.data_utils --download-data

# Create a balanced sample with 15 queries per difficulty level
python -m ragas_examples.text2sql.data_utils --create-sample

# Create a validated sample with 5 per level
python -m ragas_examples.text2sql.data_utils --create-sample --samples 5 --validate

# Only include queries that return actual data
python -m ragas_examples.text2sql.data_utils --create-sample --validate --require-data

Code Reference

Field Value
Source Location examples/ragas_examples/text2sql/data_utils.py
File Size 467 lines

Function Signatures

def download_booksql_dataset() -> bool
def create_sample_dataset(
    input_file: str = "BookSQL-files/BookSQL/train.json",
    output_dir: str = "datasets",
    output_filename: str = "booksql_sample.csv",
    samples_per_level: int = 10,
    random_seed: int = 42,
    validate_queries: bool = False,
    require_data: bool = False
) -> bool
def validate_samples(
    data: DataFrame, level: str, samples_per_level: int,
    random_seed: int, require_data: bool = False
) -> DataFrame

I/O Contract

Function Input Output
download_booksql_dataset None (downloads from Exploration-Lab/BookSQL on HF Hub) bool -- True if download succeeded; files written to ./BookSQL-files/
create_sample_dataset input_file (path to train.json), sampling and validation parameters bool -- True if CSV saved successfully to output_dir/output_filename
validate_samples data: DataFrame, difficulty level, sample count, seed, require_data flag DataFrame -- Validated samples for the specified difficulty level
load_and_clean_data input_file: str (path to train.json) DataFrame -- Deduplicated training records
save_results data: DataFrame, output_dir, output_filename, random_seed bool -- True if CSV saved successfully

Usage Examples

from ragas_examples.text2sql.data_utils import (
    create_sample_dataset,
    download_booksql_dataset,
)

# Step 1: Download the dataset
download_booksql_dataset()

# Step 2: Create a validated sample dataset
success = create_sample_dataset(
    samples_per_level=10,
    validate_queries=True,
    require_data=True,
)
print(f"Dataset created: {success}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment