Implementation:Hpcaitech ColossalAI Prepare Dataset SFT

Knowledge Sources	ColossalAI
Domains	NLP, Data_Engineering
Last Updated	2026-02-09 00:00 GMT

Overview

CLI tool for tokenizing and formatting raw conversational datasets for SFT training, provided by ColossalChat.

Description

The prepare_dataset.py script is the primary data preparation entry point for ColossalChat training workflows. It reads raw JSONL datasets containing conversations, applies model-specific conversation templates, tokenizes using a HuggingFace tokenizer, and saves the result as Arrow-format datasets. It supports four dataset types: sft, prompt, preference, and kto.

Usage

Run this script as a CLI command before starting SFT training. It produces tokenized datasets that are loaded by load_tokenized_dataset() during training.

Code Reference

Source Location

Repository: ColossalAI
File: applications/ColossalChat/examples/data_preparation_scripts/prepare_dataset.py
Lines: 1-272

Signature

python prepare_dataset.py \
    --type sft \
    --data_input_dirs /path/to/raw/data \
    --tokenizer_dir /path/to/tokenizer \
    --data_output_dirs /path/to/output \
    --conversation_template_config /path/to/template.json \
    --max_length 8192 \
    --max_samples 1000

Import

# CLI tool - no Python import needed
# Run via: python prepare_dataset.py --type sft ...

I/O Contract

Inputs

Name	Type	Required	Description
--type	str	Yes	Dataset type: "sft", "prompt", "preference", or "kto"
--data_input_dirs	str	Yes	Comma-separated paths to raw JSONL datasets
--tokenizer_dir	str	Yes	Path to HuggingFace tokenizer directory
--data_output_dirs	str	Yes	Output directory for tokenized Arrow datasets
--conversation_template_config	str	Yes	Path to JSON conversation template config
--max_length	int	No	Maximum sequence length (default: 8192)
--max_samples	int	No	Maximum samples per dataset (default: unlimited)

Outputs

Name	Type	Description
Arrow datasets	Directory	Tokenized datasets saved via save_to_disk(), containing input_ids, attention_mask, and loss_mask fields

Usage Examples

Basic SFT Data Preparation

# Prepare SFT dataset with LLaMA conversation template
python applications/ColossalChat/examples/data_preparation_scripts/prepare_dataset.py \
    --type sft \
    --data_input_dirs /data/raw/alpaca,/data/raw/sharegpt \
    --tokenizer_dir meta-llama/Llama-2-7b-hf \
    --data_output_dirs /data/tokenized/sft \
    --conversation_template_config conversation_template/llama2.json \
    --max_length 4096

Related Pages

Implements Principle

Principle:Hpcaitech_ColossalAI_SFT_Data_Preparation

Environment and Heuristic Links

Environment:Hpcaitech_ColossalAI_ColossalChat_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment