Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Volcengine Verl Datasets Load Dataset

From Leeroopedia


Field Value
Knowledge Sources Wrapper Doc (wraps HuggingFace datasets library)
Domains Dataset Acquisition, Data Loading, HuggingFace Hub Integration
Last Updated 2026-02-07

Overview

Description

This implementation documents how verl uses the HuggingFace datasets library to load raw datasets from the HuggingFace Hub or from local paths. The datasets.load_dataset() function is a core dependency used across all verl data preprocessing scripts to acquire source datasets before transforming them into the verl-standard Parquet format.

The function supports multiple loading modes:

  • Hub loading: datasets.load_dataset("openai/gsm8k", "main") downloads from HuggingFace Hub
  • Local loading: datasets.load_dataset("/local/path/to/dataset") loads from a local directory
  • Split selection: datasets.load_dataset("Dahoas/full-hh-rlhf", split="train[:75%]") loads a specific split or subset

The loaded Dataset or DatasetDict object provides .map() for transformation, .to_parquet() for serialization, and standard indexing for data access.

Usage

import datasets

# Load from HuggingFace Hub
dataset = datasets.load_dataset("openai/gsm8k", "main")

# Load from local path
dataset = datasets.load_dataset("/local/path/to/gsm8k", "main")

External reference: HuggingFace Datasets Documentation

Code Reference

Attribute Detail
Source Location Used across multiple preprocessing scripts
Signature datasets.load_dataset(path, name=None, split=None, ...)
Import import datasets or from datasets import load_dataset

Usage locations in verl:

File Usage
examples/data_preprocess/gsm8k.py datasets.load_dataset("openai/gsm8k", "main")
examples/data_preprocess/full_hh_rlhf.py load_dataset("Dahoas/full-hh-rlhf") and load_dataset(..., split="train[:75%]")
examples/data_preprocess/gsm8k_multiturn_w_tool.py datasets.load_dataset("openai/gsm8k", "main")

I/O Contract

Inputs

Parameter Type Description
path str HuggingFace dataset name (e.g., "openai/gsm8k") or local directory path
name str (optional) Configuration name within the dataset (e.g., "main" for GSM8K)
split str (optional) Specific split to load (e.g., "train", "train[:75%]"); returns DatasetDict if omitted
data_dir str (optional) Subdirectory within the dataset to load
cache_dir str (optional) Local cache directory for downloaded data

Outputs

Output Type Description
Return value (no split) DatasetDict Dictionary-like object mapping split names to Dataset objects
Return value (with split) Dataset Single Dataset object for the requested split
Cached files Local files Downloaded dataset files cached in ~/.cache/huggingface/datasets/

Usage Examples

Example 1: Load GSM8K dataset from HuggingFace Hub

import datasets

dataset = datasets.load_dataset("openai/gsm8k", "main")
train_dataset = dataset["train"]
test_dataset = dataset["test"]

print(f"Train size: {len(train_dataset)}")  # ~7473
print(f"Test size: {len(test_dataset)}")    # ~1319
print(f"Columns: {train_dataset.column_names}")  # ['question', 'answer']

Example 2: Load HH-RLHF with split selection

from datasets import load_dataset

# Load 75% of training data for RM training
train_dataset = load_dataset("Dahoas/full-hh-rlhf", split="train[:75%]")
# Load remaining 25% for RM validation
test_dataset = load_dataset("Dahoas/full-hh-rlhf", split="train[-25%:]")

print(f"RM train size: {len(train_dataset)}")
print(f"RM test size: {len(test_dataset)}")

Example 3: Load from a local path

import datasets

local_path = "/data/my_local_gsm8k"
dataset = datasets.load_dataset(local_path, "main")
train_dataset = dataset["train"]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment