Implementation:Volcengine Verl Datasets Load Dataset

Field	Value
Knowledge Sources	Wrapper Doc (wraps HuggingFace datasets library)
Domains	Dataset Acquisition, Data Loading, HuggingFace Hub Integration
Last Updated	2026-02-07

Overview

Description

This implementation documents how verl uses the HuggingFace datasets library to load raw datasets from the HuggingFace Hub or from local paths. The datasets.load_dataset() function is a core dependency used across all verl data preprocessing scripts to acquire source datasets before transforming them into the verl-standard Parquet format.

The function supports multiple loading modes:

Hub loading: datasets.load_dataset("openai/gsm8k", "main") downloads from HuggingFace Hub
Local loading: datasets.load_dataset("/local/path/to/dataset") loads from a local directory
Split selection: datasets.load_dataset("Dahoas/full-hh-rlhf", split="train[:75%]") loads a specific split or subset

The loaded Dataset or DatasetDict object provides .map() for transformation, .to_parquet() for serialization, and standard indexing for data access.

Usage

import datasets

# Load from HuggingFace Hub
dataset = datasets.load_dataset("openai/gsm8k", "main")

# Load from local path
dataset = datasets.load_dataset("/local/path/to/gsm8k", "main")

External reference: HuggingFace Datasets Documentation

Code Reference

Attribute	Detail
Source Location	Used across multiple preprocessing scripts
Signature	`datasets.load_dataset(path, name=None, split=None, ...)`
Import	`import datasets` or `from datasets import load_dataset`

Usage locations in verl:

File	Usage
`examples/data_preprocess/gsm8k.py`	`datasets.load_dataset("openai/gsm8k", "main")`
`examples/data_preprocess/full_hh_rlhf.py`	`load_dataset("Dahoas/full-hh-rlhf")` and `load_dataset(..., split="train[:75%]")`
`examples/data_preprocess/gsm8k_multiturn_w_tool.py`	`datasets.load_dataset("openai/gsm8k", "main")`

I/O Contract

Inputs

Parameter	Type	Description
`path`	`str`	HuggingFace dataset name (e.g., `"openai/gsm8k"`) or local directory path
`name`	`str` (optional)	Configuration name within the dataset (e.g., `"main"` for GSM8K)
`split`	`str` (optional)	Specific split to load (e.g., `"train"`, `"train[:75%]"`); returns `DatasetDict` if omitted
`data_dir`	`str` (optional)	Subdirectory within the dataset to load
`cache_dir`	`str` (optional)	Local cache directory for downloaded data

Outputs

Output	Type	Description
Return value (no split)	`DatasetDict`	Dictionary-like object mapping split names to `Dataset` objects
Return value (with split)	`Dataset`	Single `Dataset` object for the requested split
Cached files	Local files	Downloaded dataset files cached in `~/.cache/huggingface/datasets/`

Usage Examples

Example 1: Load GSM8K dataset from HuggingFace Hub

import datasets

dataset = datasets.load_dataset("openai/gsm8k", "main")
train_dataset = dataset["train"]
test_dataset = dataset["test"]

print(f"Train size: {len(train_dataset)}")  # ~7473
print(f"Test size: {len(test_dataset)}")    # ~1319
print(f"Columns: {train_dataset.column_names}")  # ['question', 'answer']

Example 2: Load HH-RLHF with split selection

from datasets import load_dataset

# Load 75% of training data for RM training
train_dataset = load_dataset("Dahoas/full-hh-rlhf", split="train[:75%]")
# Load remaining 25% for RM validation
test_dataset = load_dataset("Dahoas/full-hh-rlhf", split="train[-25%:]")

print(f"RM train size: {len(train_dataset)}")
print(f"RM test size: {len(test_dataset)}")

Example 3: Load from a local path

import datasets

local_path = "/data/my_local_gsm8k"
dataset = datasets.load_dataset(local_path, "main")
train_dataset = dataset["train"]

Related Pages

Principle:Volcengine_Verl_Dataset_Acquisition
Implementation:Volcengine_Verl_GSM8K_Data_Preprocessing -- Uses load_dataset for GSM8K
Implementation:Volcengine_Verl_HH_RLHF_Data_Preprocessing -- Uses load_dataset for HH-RLHF
Implementation:Volcengine_Verl_Multi_Turn_Data_Preprocessing -- Uses load_dataset for multi-turn GSM8K
Implementation:Volcengine_Verl_Dataset_To_Parquet -- Downstream Parquet export
HuggingFace Datasets Documentation -- External library reference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment