Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Unslothai Unsloth Raw Text Data Loading

From Leeroopedia
Revision as of 17:26, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Unslothai_Unsloth_Raw_Text_Data_Loading.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains NLP, Data_Preprocessing
Last Updated 2026-02-07 00:00 GMT

Overview

A data ingestion technique that converts raw text documents into tokenized, overlapping chunks suitable for causal language model continued pretraining.

Description

Raw text data loading addresses the need to train language models on unstructured text documents (plain text, markdown, JSON, CSV, etc.) rather than pre-formatted conversational datasets. The principle involves reading raw text files, splitting them into fixed-size token chunks with configurable overlap (stride), and producing a HuggingFace Dataset with input_ids and attention_mask columns ready for causal LM training.

The stride overlap between consecutive chunks preserves context continuity at chunk boundaries, preventing the model from encountering abrupt context breaks during training.

Usage

Use this principle when performing continued pretraining (domain adaptation) on raw text corpora rather than instruction-tuning on conversational data. Typical scenarios include pretraining on domain-specific documents (legal, medical, scientific text) or extending a model's knowledge with custom text collections.

Theoretical Basis

The chunking algorithm applies a sliding window over the tokenized document:

# Abstract sliding window chunking
tokens = tokenizer.encode(document_text)
chunks = []
for i in range(0, len(tokens), chunk_size - stride):
    chunk = tokens[i : i + chunk_size]
    if len(chunk) >= min_chunk_size:
        chunks.append(chunk)
# Each chunk overlaps with the previous by `stride` tokens

The overlap ensures that for any position in the original document, there exists at least one chunk where that position has sufficient preceding context for the model to learn meaningful next-token predictions.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment