Implementation:Volcengine Verl Datasets Load Dataset
| Field | Value |
|---|---|
| Knowledge Sources | Wrapper Doc (wraps HuggingFace datasets library) |
| Domains | Dataset Acquisition, Data Loading, HuggingFace Hub Integration |
| Last Updated | 2026-02-07 |
Overview
Description
This implementation documents how verl uses the HuggingFace datasets library to load raw datasets from the HuggingFace Hub or from local paths. The datasets.load_dataset() function is a core dependency used across all verl data preprocessing scripts to acquire source datasets before transforming them into the verl-standard Parquet format.
The function supports multiple loading modes:
- Hub loading:
datasets.load_dataset("openai/gsm8k", "main")downloads from HuggingFace Hub - Local loading:
datasets.load_dataset("/local/path/to/dataset")loads from a local directory - Split selection:
datasets.load_dataset("Dahoas/full-hh-rlhf", split="train[:75%]")loads a specific split or subset
The loaded Dataset or DatasetDict object provides .map() for transformation, .to_parquet() for serialization, and standard indexing for data access.
Usage
import datasets
# Load from HuggingFace Hub
dataset = datasets.load_dataset("openai/gsm8k", "main")
# Load from local path
dataset = datasets.load_dataset("/local/path/to/gsm8k", "main")
External reference: HuggingFace Datasets Documentation
Code Reference
| Attribute | Detail |
|---|---|
| Source Location | Used across multiple preprocessing scripts |
| Signature | datasets.load_dataset(path, name=None, split=None, ...)
|
| Import | import datasets or from datasets import load_dataset
|
Usage locations in verl:
| File | Usage |
|---|---|
examples/data_preprocess/gsm8k.py |
datasets.load_dataset("openai/gsm8k", "main")
|
examples/data_preprocess/full_hh_rlhf.py |
load_dataset("Dahoas/full-hh-rlhf") and load_dataset(..., split="train[:75%]")
|
examples/data_preprocess/gsm8k_multiturn_w_tool.py |
datasets.load_dataset("openai/gsm8k", "main")
|
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
path |
str |
HuggingFace dataset name (e.g., "openai/gsm8k") or local directory path
|
name |
str (optional) |
Configuration name within the dataset (e.g., "main" for GSM8K)
|
split |
str (optional) |
Specific split to load (e.g., "train", "train[:75%]"); returns DatasetDict if omitted
|
data_dir |
str (optional) |
Subdirectory within the dataset to load |
cache_dir |
str (optional) |
Local cache directory for downloaded data |
Outputs
| Output | Type | Description |
|---|---|---|
| Return value (no split) | DatasetDict |
Dictionary-like object mapping split names to Dataset objects
|
| Return value (with split) | Dataset |
Single Dataset object for the requested split
|
| Cached files | Local files | Downloaded dataset files cached in ~/.cache/huggingface/datasets/
|
Usage Examples
Example 1: Load GSM8K dataset from HuggingFace Hub
import datasets
dataset = datasets.load_dataset("openai/gsm8k", "main")
train_dataset = dataset["train"]
test_dataset = dataset["test"]
print(f"Train size: {len(train_dataset)}") # ~7473
print(f"Test size: {len(test_dataset)}") # ~1319
print(f"Columns: {train_dataset.column_names}") # ['question', 'answer']
Example 2: Load HH-RLHF with split selection
from datasets import load_dataset
# Load 75% of training data for RM training
train_dataset = load_dataset("Dahoas/full-hh-rlhf", split="train[:75%]")
# Load remaining 25% for RM validation
test_dataset = load_dataset("Dahoas/full-hh-rlhf", split="train[-25%:]")
print(f"RM train size: {len(train_dataset)}")
print(f"RM test size: {len(test_dataset)}")
Example 3: Load from a local path
import datasets
local_path = "/data/my_local_gsm8k"
dataset = datasets.load_dataset(local_path, "main")
train_dataset = dataset["train"]
Related Pages
- Principle:Volcengine_Verl_Dataset_Acquisition
- Implementation:Volcengine_Verl_GSM8K_Data_Preprocessing -- Uses
load_datasetfor GSM8K - Implementation:Volcengine_Verl_HH_RLHF_Data_Preprocessing -- Uses
load_datasetfor HH-RLHF - Implementation:Volcengine_Verl_Multi_Turn_Data_Preprocessing -- Uses
load_datasetfor multi-turn GSM8K - Implementation:Volcengine_Verl_Dataset_To_Parquet -- Downstream Parquet export
- HuggingFace Datasets Documentation -- External library reference