Implementation:Hpcaitech ColossalAI Dataset Utils
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Data Processing, RLHF |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Dataset utility functions for ColossalChat that handle tokenization, padding, truncation, and prompt splitting.
Description
This module provides a collection of utility functions used by the ColossalChat dataset pipeline. Key functions include pad_to_max_len for left/right padding of tensor sequences, chuncate_sequence for truncating sequences to a maximum length, tokenize_and_concatenate for tokenizing multi-turn conversations with selective loss masking, and split_templated_prompt_into_chunks for breaking formatted prompts into human/assistant chunks with loss requirement annotations. It also provides helpers for JSON loading, distributed rank checking, and schema-based field extraction from dataset entries.
Usage
Use these utilities when building data preprocessing pipelines for ColossalChat SFT or RLHF training. They are particularly important for correctly handling loss masking on assistant-only tokens during supervised fine-tuning.
Code Reference
Source Location
- Repository: Hpcaitech_ColossalAI
- File: applications/ColossalChat/coati/dataset/utils.py
- Lines: 1-170
Signature
def is_rank_0() -> bool:
def jload(f, mode="r"):
def read_string_by_schema(data: Dict[str, Any], schema: str) -> str:
def pad_to_max_len(
sequence: List[torch.Tensor], max_length: int, padding_value: int,
batch_first: bool = True, padding_side="left"
):
def chuncate_sequence(sequence: List[torch.Tensor], max_length: int, dtype: Any):
def find_first_occurrence_subsequence(seq: torch.Tensor, subseq: torch.Tensor, start_index: int = 0) -> int:
def tokenize_and_concatenate(
tokenizer: PreTrainedTokenizer,
text: List[str],
require_loss: List[bool],
max_length: int,
discard_non_loss_tokens_at_tail: bool = True,
):
def split_templated_prompt_into_chunks(
messages: List[Dict[str, str]], prompt: str, end_of_assistant: str
):
Import
from coati.dataset.utils import (
is_rank_0, jload, pad_to_max_len, chuncate_sequence,
tokenize_and_concatenate, split_templated_prompt_into_chunks,
read_string_by_schema, find_first_occurrence_subsequence,
)
I/O Contract
Inputs (tokenize_and_concatenate)
| Name | Type | Required | Description |
|---|---|---|---|
| tokenizer | PreTrainedTokenizer | Yes | The tokenizer to use for tokenization |
| text | List[str] | Yes | List of text chunks to tokenize |
| require_loss | List[bool] | Yes | Boolean list indicating which chunks need loss calculation |
| max_length | int | Yes | Maximum length for truncation |
| discard_non_loss_tokens_at_tail | bool | No | Whether to discard non-loss tokens at the tail, defaults to True |
Outputs (tokenize_and_concatenate)
| Name | Type | Description |
|---|---|---|
| input_ids | List[int] or None | Concatenated token IDs, or None if first user query exceeds max_length |
| loss_starts | List[int] or None | Start positions of loss spans |
| loss_ends | List[int] or None | End positions of loss spans |
Usage Examples
from coati.dataset.utils import pad_to_max_len, tokenize_and_concatenate, split_templated_prompt_into_chunks
# Pad a batch of sequences to a maximum length (left padding)
import torch
sequences = [torch.tensor([1, 2, 3]), torch.tensor([4, 5])]
padded = pad_to_max_len(sequences, max_length=5, padding_value=0, padding_side="left")
# Split a templated prompt into chunks for loss masking
messages = [
{"role": "user", "content": "Hello"},
{"role": "assistant", "content": "Hi there!"},
]
chunks, require_loss = split_templated_prompt_into_chunks(messages, prompt_str, end_of_assistant="</s>")
input_ids, loss_starts, loss_ends = tokenize_and_concatenate(tokenizer, chunks, require_loss, max_length=2048)