Implementation:Hpcaitech ColossalAI Prepare Dataset SFT
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Engineering |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
CLI tool for tokenizing and formatting raw conversational datasets for SFT training, provided by ColossalChat.
Description
The prepare_dataset.py script is the primary data preparation entry point for ColossalChat training workflows. It reads raw JSONL datasets containing conversations, applies model-specific conversation templates, tokenizes using a HuggingFace tokenizer, and saves the result as Arrow-format datasets. It supports four dataset types: sft, prompt, preference, and kto.
Usage
Run this script as a CLI command before starting SFT training. It produces tokenized datasets that are loaded by load_tokenized_dataset() during training.
Code Reference
Source Location
- Repository: ColossalAI
- File: applications/ColossalChat/examples/data_preparation_scripts/prepare_dataset.py
- Lines: 1-272
Signature
python prepare_dataset.py \
--type sft \
--data_input_dirs /path/to/raw/data \
--tokenizer_dir /path/to/tokenizer \
--data_output_dirs /path/to/output \
--conversation_template_config /path/to/template.json \
--max_length 8192 \
--max_samples 1000
Import
# CLI tool - no Python import needed
# Run via: python prepare_dataset.py --type sft ...
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --type | str | Yes | Dataset type: "sft", "prompt", "preference", or "kto" |
| --data_input_dirs | str | Yes | Comma-separated paths to raw JSONL datasets |
| --tokenizer_dir | str | Yes | Path to HuggingFace tokenizer directory |
| --data_output_dirs | str | Yes | Output directory for tokenized Arrow datasets |
| --conversation_template_config | str | Yes | Path to JSON conversation template config |
| --max_length | int | No | Maximum sequence length (default: 8192) |
| --max_samples | int | No | Maximum samples per dataset (default: unlimited) |
Outputs
| Name | Type | Description |
|---|---|---|
| Arrow datasets | Directory | Tokenized datasets saved via save_to_disk(), containing input_ids, attention_mask, and loss_mask fields |
Usage Examples
Basic SFT Data Preparation
# Prepare SFT dataset with LLaMA conversation template
python applications/ColossalChat/examples/data_preparation_scripts/prepare_dataset.py \
--type sft \
--data_input_dirs /data/raw/alpaca,/data/raw/sharegpt \
--tokenizer_dir meta-llama/Llama-2-7b-hf \
--data_output_dirs /data/tokenized/sft \
--conversation_template_config conversation_template/llama2.json \
--max_length 4096