Principle:Volcengine Verl Dataset Acquisition
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, Multimodal |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
The process of loading raw datasets from HuggingFace Hub or local storage as the first step in any data preprocessing pipeline.
Description
Dataset Acquisition is the entry point for all data preprocessing workflows in verl. It uses the HuggingFace datasets library to load datasets from:
- HuggingFace Hub: Via dataset identifiers (e.g.,
openai/gsm8k) - Local paths: Via file system paths for custom or pre-downloaded datasets
The loaded dataset provides standardized access to train/test splits and column data, regardless of the underlying storage format.
Usage
Use dataset acquisition as the first step in any data preprocessing script. All verl preprocessing scripts start with datasets.load_dataset().
Theoretical Basis
Dataset acquisition is a simple data loading step:
# Abstract dataset acquisition
dataset = datasets.load_dataset(data_source, config_name)
# Access splits
train_data = dataset["train"]
test_data = dataset["test"]