Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Hpcaitech ColossalAI Prepare Dataset Preference

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Engineering
Last Updated 2026-02-09 00:00 GMT

Overview

CLI tool for tokenizing preference pair datasets for DPO alignment training, provided by ColossalChat.

Description

Uses the same prepare_dataset.py script as SFT, but with --type preference. This mode processes JSONL files containing chosen/rejected response pairs, tokenizing them into parallel sequences with independent loss masks.

Usage

Run before DPO training to create tokenized preference datasets.

Code Reference

Source Location

  • Repository: ColossalAI
  • File: applications/ColossalChat/examples/data_preparation_scripts/prepare_dataset.py
  • Lines: 1-272

Signature

python prepare_dataset.py \
    --type preference \
    --data_input_dirs /path/to/preference/data \
    --tokenizer_dir /path/to/tokenizer \
    --data_output_dirs /path/to/output \
    --conversation_template_config /path/to/template.json \
    --max_length 8192

Import

# CLI tool - no Python import needed

I/O Contract

Inputs

Name Type Required Description
--type str Yes Must be "preference" for DPO
--data_input_dirs str Yes Paths to JSONL with chosen/rejected pairs
--tokenizer_dir str Yes Tokenizer directory
--data_output_dirs str Yes Output directory
--conversation_template_config str Yes Conversation template JSON
--max_length int No Maximum sequence length (default: 8192)

Outputs

Name Type Description
Arrow datasets Directory Tokenized datasets with chosen_input_ids, chosen_loss_mask, rejected_input_ids, rejected_loss_mask fields

Usage Examples

python prepare_dataset.py \
    --type preference \
    --data_input_dirs /data/raw/hh-rlhf \
    --tokenizer_dir meta-llama/Llama-2-7b-hf \
    --data_output_dirs /data/tokenized/preference \
    --conversation_template_config conversation_template/llama2.json \
    --max_length 4096

Related Pages

Implements Principle

Environment and Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment