Principle:Unslothai Unsloth Raw Text Data Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Preprocessing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A data ingestion technique that converts raw text documents into tokenized, overlapping chunks suitable for causal language model continued pretraining.
Description
Raw text data loading addresses the need to train language models on unstructured text documents (plain text, markdown, JSON, CSV, etc.) rather than pre-formatted conversational datasets. The principle involves reading raw text files, splitting them into fixed-size token chunks with configurable overlap (stride), and producing a HuggingFace Dataset with input_ids and attention_mask columns ready for causal LM training.
The stride overlap between consecutive chunks preserves context continuity at chunk boundaries, preventing the model from encountering abrupt context breaks during training.
Usage
Use this principle when performing continued pretraining (domain adaptation) on raw text corpora rather than instruction-tuning on conversational data. Typical scenarios include pretraining on domain-specific documents (legal, medical, scientific text) or extending a model's knowledge with custom text collections.
Theoretical Basis
The chunking algorithm applies a sliding window over the tokenized document:
# Abstract sliding window chunking
tokens = tokenizer.encode(document_text)
chunks = []
for i in range(0, len(tokens), chunk_size - stride):
chunk = tokens[i : i + chunk_size]
if len(chunk) >= min_chunk_size:
chunks.append(chunk)
# Each chunk overlaps with the previous by `stride` tokens
The overlap ensures that for any position in the original document, there exists at least one chunk where that position has sufficient preceding context for the model to learn meaningful next-token predictions.