Implementation:Hpcaitech ColossalAI Prepare Dataset Preference

Knowledge Sources	ColossalAI
Domains	NLP, Data_Engineering
Last Updated	2026-02-09 00:00 GMT

Overview

CLI tool for tokenizing preference pair datasets for DPO alignment training, provided by ColossalChat.

Description

Uses the same prepare_dataset.py script as SFT, but with --type preference. This mode processes JSONL files containing chosen/rejected response pairs, tokenizing them into parallel sequences with independent loss masks.

Usage

Run before DPO training to create tokenized preference datasets.

Code Reference

Source Location

Repository: ColossalAI
File: applications/ColossalChat/examples/data_preparation_scripts/prepare_dataset.py
Lines: 1-272

Signature

python prepare_dataset.py \
    --type preference \
    --data_input_dirs /path/to/preference/data \
    --tokenizer_dir /path/to/tokenizer \
    --data_output_dirs /path/to/output \
    --conversation_template_config /path/to/template.json \
    --max_length 8192

Import

# CLI tool - no Python import needed

I/O Contract

Inputs

Name	Type	Required	Description
--type	str	Yes	Must be "preference" for DPO
--data_input_dirs	str	Yes	Paths to JSONL with chosen/rejected pairs
--tokenizer_dir	str	Yes	Tokenizer directory
--data_output_dirs	str	Yes	Output directory
--conversation_template_config	str	Yes	Conversation template JSON
--max_length	int	No	Maximum sequence length (default: 8192)

Outputs

Name	Type	Description
Arrow datasets	Directory	Tokenized datasets with chosen_input_ids, chosen_loss_mask, rejected_input_ids, rejected_loss_mask fields

Usage Examples

python prepare_dataset.py \
    --type preference \
    --data_input_dirs /data/raw/hh-rlhf \
    --tokenizer_dir meta-llama/Llama-2-7b-hf \
    --data_output_dirs /data/tokenized/preference \
    --conversation_template_config conversation_template/llama2.json \
    --max_length 4096

Related Pages

Implements Principle

Principle:Hpcaitech_ColossalAI_Preference_Data_Preparation

Environment and Heuristic Links

Environment:Hpcaitech_ColossalAI_ColossalChat_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment