Implementation:Hpcaitech ColossalAI Prepare Dataset Preference
Appearance
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Engineering |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
CLI tool for tokenizing preference pair datasets for DPO alignment training, provided by ColossalChat.
Description
Uses the same prepare_dataset.py script as SFT, but with --type preference. This mode processes JSONL files containing chosen/rejected response pairs, tokenizing them into parallel sequences with independent loss masks.
Usage
Run before DPO training to create tokenized preference datasets.
Code Reference
Source Location
- Repository: ColossalAI
- File: applications/ColossalChat/examples/data_preparation_scripts/prepare_dataset.py
- Lines: 1-272
Signature
python prepare_dataset.py \
--type preference \
--data_input_dirs /path/to/preference/data \
--tokenizer_dir /path/to/tokenizer \
--data_output_dirs /path/to/output \
--conversation_template_config /path/to/template.json \
--max_length 8192
Import
# CLI tool - no Python import needed
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --type | str | Yes | Must be "preference" for DPO |
| --data_input_dirs | str | Yes | Paths to JSONL with chosen/rejected pairs |
| --tokenizer_dir | str | Yes | Tokenizer directory |
| --data_output_dirs | str | Yes | Output directory |
| --conversation_template_config | str | Yes | Conversation template JSON |
| --max_length | int | No | Maximum sequence length (default: 8192) |
Outputs
| Name | Type | Description |
|---|---|---|
| Arrow datasets | Directory | Tokenized datasets with chosen_input_ids, chosen_loss_mask, rejected_input_ids, rejected_loss_mask fields |
Usage Examples
python prepare_dataset.py \
--type preference \
--data_input_dirs /data/raw/hh-rlhf \
--tokenizer_dir meta-llama/Llama-2-7b-hf \
--data_output_dirs /data/tokenized/preference \
--conversation_template_config conversation_template/llama2.json \
--max_length 4096
Related Pages
Implements Principle
Environment and Heuristic Links
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment