Implementation:Hpcaitech ColossalAI Dataset Utils

Knowledge Sources	Hpcaitech_ColossalAI
Domains	Natural Language Processing, Data Processing, RLHF
Last Updated	2026-02-09 00:00 GMT

Overview

Dataset utility functions for ColossalChat that handle tokenization, padding, truncation, and prompt splitting.

Description

This module provides a collection of utility functions used by the ColossalChat dataset pipeline. Key functions include pad_to_max_len for left/right padding of tensor sequences, chuncate_sequence for truncating sequences to a maximum length, tokenize_and_concatenate for tokenizing multi-turn conversations with selective loss masking, and split_templated_prompt_into_chunks for breaking formatted prompts into human/assistant chunks with loss requirement annotations. It also provides helpers for JSON loading, distributed rank checking, and schema-based field extraction from dataset entries.

Usage

Use these utilities when building data preprocessing pipelines for ColossalChat SFT or RLHF training. They are particularly important for correctly handling loss masking on assistant-only tokens during supervised fine-tuning.

Code Reference

Source Location

Repository: Hpcaitech_ColossalAI
File: applications/ColossalChat/coati/dataset/utils.py
Lines: 1-170

Signature

def is_rank_0() -> bool:

def jload(f, mode="r"):

def read_string_by_schema(data: Dict[str, Any], schema: str) -> str:

def pad_to_max_len(
    sequence: List[torch.Tensor], max_length: int, padding_value: int,
    batch_first: bool = True, padding_side="left"
):

def chuncate_sequence(sequence: List[torch.Tensor], max_length: int, dtype: Any):

def find_first_occurrence_subsequence(seq: torch.Tensor, subseq: torch.Tensor, start_index: int = 0) -> int:

def tokenize_and_concatenate(
    tokenizer: PreTrainedTokenizer,
    text: List[str],
    require_loss: List[bool],
    max_length: int,
    discard_non_loss_tokens_at_tail: bool = True,
):

def split_templated_prompt_into_chunks(
    messages: List[Dict[str, str]], prompt: str, end_of_assistant: str
):

Import

from coati.dataset.utils import (
    is_rank_0, jload, pad_to_max_len, chuncate_sequence,
    tokenize_and_concatenate, split_templated_prompt_into_chunks,
    read_string_by_schema, find_first_occurrence_subsequence,
)

I/O Contract

Inputs (tokenize_and_concatenate)

Name	Type	Required	Description
tokenizer	PreTrainedTokenizer	Yes	The tokenizer to use for tokenization
text	List[str]	Yes	List of text chunks to tokenize
require_loss	List[bool]	Yes	Boolean list indicating which chunks need loss calculation
max_length	int	Yes	Maximum length for truncation
discard_non_loss_tokens_at_tail	bool	No	Whether to discard non-loss tokens at the tail, defaults to True

Outputs (tokenize_and_concatenate)

Name	Type	Description
input_ids	List[int] or None	Concatenated token IDs, or None if first user query exceeds max_length
loss_starts	List[int] or None	Start positions of loss spans
loss_ends	List[int] or None	End positions of loss spans

Usage Examples

from coati.dataset.utils import pad_to_max_len, tokenize_and_concatenate, split_templated_prompt_into_chunks

# Pad a batch of sequences to a maximum length (left padding)
import torch
sequences = [torch.tensor([1, 2, 3]), torch.tensor([4, 5])]
padded = pad_to_max_len(sequences, max_length=5, padding_value=0, padding_side="left")

# Split a templated prompt into chunks for loss masking
messages = [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"},
]
chunks, require_loss = split_templated_prompt_into_chunks(messages, prompt_str, end_of_assistant="</s>")
input_ids, loss_starts, loss_ends = tokenize_and_concatenate(tokenizer, chunks, require_loss, max_length=2048)

Related Pages

Environment:Hpcaitech_ColossalAI_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment