Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI Dataset Utils

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Data Processing, RLHF
Last Updated 2026-02-09 00:00 GMT

Overview

Dataset utility functions for ColossalChat that handle tokenization, padding, truncation, and prompt splitting.

Description

This module provides a collection of utility functions used by the ColossalChat dataset pipeline. Key functions include pad_to_max_len for left/right padding of tensor sequences, chuncate_sequence for truncating sequences to a maximum length, tokenize_and_concatenate for tokenizing multi-turn conversations with selective loss masking, and split_templated_prompt_into_chunks for breaking formatted prompts into human/assistant chunks with loss requirement annotations. It also provides helpers for JSON loading, distributed rank checking, and schema-based field extraction from dataset entries.

Usage

Use these utilities when building data preprocessing pipelines for ColossalChat SFT or RLHF training. They are particularly important for correctly handling loss masking on assistant-only tokens during supervised fine-tuning.

Code Reference

Source Location

Signature

def is_rank_0() -> bool:

def jload(f, mode="r"):

def read_string_by_schema(data: Dict[str, Any], schema: str) -> str:

def pad_to_max_len(
    sequence: List[torch.Tensor], max_length: int, padding_value: int,
    batch_first: bool = True, padding_side="left"
):

def chuncate_sequence(sequence: List[torch.Tensor], max_length: int, dtype: Any):

def find_first_occurrence_subsequence(seq: torch.Tensor, subseq: torch.Tensor, start_index: int = 0) -> int:

def tokenize_and_concatenate(
    tokenizer: PreTrainedTokenizer,
    text: List[str],
    require_loss: List[bool],
    max_length: int,
    discard_non_loss_tokens_at_tail: bool = True,
):

def split_templated_prompt_into_chunks(
    messages: List[Dict[str, str]], prompt: str, end_of_assistant: str
):

Import

from coati.dataset.utils import (
    is_rank_0, jload, pad_to_max_len, chuncate_sequence,
    tokenize_and_concatenate, split_templated_prompt_into_chunks,
    read_string_by_schema, find_first_occurrence_subsequence,
)

I/O Contract

Inputs (tokenize_and_concatenate)

Name Type Required Description
tokenizer PreTrainedTokenizer Yes The tokenizer to use for tokenization
text List[str] Yes List of text chunks to tokenize
require_loss List[bool] Yes Boolean list indicating which chunks need loss calculation
max_length int Yes Maximum length for truncation
discard_non_loss_tokens_at_tail bool No Whether to discard non-loss tokens at the tail, defaults to True

Outputs (tokenize_and_concatenate)

Name Type Description
input_ids List[int] or None Concatenated token IDs, or None if first user query exceeds max_length
loss_starts List[int] or None Start positions of loss spans
loss_ends List[int] or None End positions of loss spans

Usage Examples

from coati.dataset.utils import pad_to_max_len, tokenize_and_concatenate, split_templated_prompt_into_chunks

# Pad a batch of sequences to a maximum length (left padding)
import torch
sequences = [torch.tensor([1, 2, 3]), torch.tensor([4, 5])]
padded = pad_to_max_len(sequences, max_length=5, padding_value=0, padding_side="left")

# Split a templated prompt into chunks for loss masking
messages = [
    {"role": "user", "content": "Hello"},
    {"role": "assistant", "content": "Hi there!"},
]
chunks, require_loss = split_templated_prompt_into_chunks(messages, prompt_str, end_of_assistant="</s>")
input_ids, loss_starts, loss_ends = tokenize_and_concatenate(tokenizer, chunks, require_loss, max_length=2048)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment