Implementation:Iamhankai Forest of Thought Mcts Load Data
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Preprocessing |
| Last Updated | 2026-02-14 03:00 GMT |
Overview
Concrete tool for loading and preprocessing benchmark datasets provided by the Forest-of-Thought repository.
Description
The mcts_load_data function loads benchmark datasets from disk, handles format conversion (JSONL to Parquet), applies difficulty-level filtering for MATH problems, and slices to the requested sample range. It uses the HuggingFace datasets library for efficient data handling and returns a Dataset object ready for iteration.
Usage
Call this function after parsing command-line arguments. The returned Dataset object is iterated in the main evaluation loop, with each example passed to Monte_Carlo_Forest.run() for tree-search reasoning.
Code Reference
Source Location
- Repository: Forest-of-Thought
- File: utils/utils.py
- Lines: L35-54
Signature
def mcts_load_data(args):
"""
Load and preprocess benchmark dataset for FoT evaluation.
Args:
args (argparse.Namespace): Configuration with fields:
- dataset (str): Dataset name (must contain 'gsm', 'math', or 'aime')
- dataset_filepath (str): Path to JSONL or Parquet file
- level (int): MATH difficulty level filter (1-5)
- start_id (int): Start index for sample range
- end_id (int): End index for sample range
Returns:
datasets.Dataset: Filtered and sliced HuggingFace Dataset object.
"""
Import
from utils.utils import mcts_load_data
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args | argparse.Namespace | Yes | Configuration namespace with dataset, dataset_filepath, level, start_id, end_id |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | datasets.Dataset | HuggingFace Dataset sliced to [start_id, end_id), optionally filtered by level |
Usage Examples
Loading GSM8K Dataset
from utils.utils import mcts_load_data
import argparse
args = argparse.Namespace(
dataset="gsm8k",
dataset_filepath="/data/gsm8k/test.jsonl",
level=1,
start_id=0,
end_id=100
)
dataset = mcts_load_data(args)
print(f"Loaded {len(dataset)} examples")
# Access: dataset[0]['question'], dataset[0]['answer']
Loading MATH with Level Filter
args = argparse.Namespace(
dataset="math500",
dataset_filepath="/data/math/test.jsonl",
level=5, # Only hardest problems
start_id=0,
end_id=500
)
dataset = mcts_load_data(args)
# Access: dataset[0]['problem'], dataset[0]['answer']