Principle:Volcengine Verl Parquet Export
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Storage, NLP |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
The final step of data preprocessing that serializes transformed datasets to Apache Parquet format for efficient loading during training.
Description
Parquet Export converts the processed HuggingFace Dataset objects into Apache Parquet files. Parquet is chosen for:
- Columnar storage: Efficient for reading specific columns during training
- Compression: Reduces storage requirements
- Schema preservation: Maintains complex nested types (lists of dicts, images)
- Broad compatibility: Loadable by pandas, Arrow, and HuggingFace datasets
The export produces separate files for each data split (train, test) and optionally copies them to HDFS for distributed access.
Usage
Use parquet export as the final step in any data preprocessing pipeline. Output files are consumed by verl's dataset classes (RLHFDataset, SFTDataset).
Theoretical Basis
Parquet export uses HuggingFace's built-in serialization:
# Abstract parquet export
processed_dataset = raw_dataset.map(transform_fn)
for split in ["train", "test"]:
output_path = os.path.join(save_dir, f"{split}.parquet")
processed_dataset[split].to_parquet(output_path)
# Optional: copy to HDFS
if hdfs_dir:
copy_to_hdfs(output_path, hdfs_dir)
Related Pages
Implemented By
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment