Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Hpcaitech ColossalAI Prepare Dataset SFT

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Engineering
Last Updated 2026-02-09 00:00 GMT

Overview

CLI tool for tokenizing and formatting raw conversational datasets for SFT training, provided by ColossalChat.

Description

The prepare_dataset.py script is the primary data preparation entry point for ColossalChat training workflows. It reads raw JSONL datasets containing conversations, applies model-specific conversation templates, tokenizes using a HuggingFace tokenizer, and saves the result as Arrow-format datasets. It supports four dataset types: sft, prompt, preference, and kto.

Usage

Run this script as a CLI command before starting SFT training. It produces tokenized datasets that are loaded by load_tokenized_dataset() during training.

Code Reference

Source Location

  • Repository: ColossalAI
  • File: applications/ColossalChat/examples/data_preparation_scripts/prepare_dataset.py
  • Lines: 1-272

Signature

python prepare_dataset.py \
    --type sft \
    --data_input_dirs /path/to/raw/data \
    --tokenizer_dir /path/to/tokenizer \
    --data_output_dirs /path/to/output \
    --conversation_template_config /path/to/template.json \
    --max_length 8192 \
    --max_samples 1000

Import

# CLI tool - no Python import needed
# Run via: python prepare_dataset.py --type sft ...

I/O Contract

Inputs

Name Type Required Description
--type str Yes Dataset type: "sft", "prompt", "preference", or "kto"
--data_input_dirs str Yes Comma-separated paths to raw JSONL datasets
--tokenizer_dir str Yes Path to HuggingFace tokenizer directory
--data_output_dirs str Yes Output directory for tokenized Arrow datasets
--conversation_template_config str Yes Path to JSON conversation template config
--max_length int No Maximum sequence length (default: 8192)
--max_samples int No Maximum samples per dataset (default: unlimited)

Outputs

Name Type Description
Arrow datasets Directory Tokenized datasets saved via save_to_disk(), containing input_ids, attention_mask, and loss_mask fields

Usage Examples

Basic SFT Data Preparation

# Prepare SFT dataset with LLaMA conversation template
python applications/ColossalChat/examples/data_preparation_scripts/prepare_dataset.py \
    --type sft \
    --data_input_dirs /data/raw/alpaca,/data/raw/sharegpt \
    --tokenizer_dir meta-llama/Llama-2-7b-hf \
    --data_output_dirs /data/tokenized/sft \
    --conversation_template_config conversation_template/llama2.json \
    --max_length 4096

Related Pages

Implements Principle

Environment and Heuristic Links

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment