Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Unslothai Unsloth RawTextDataLoader

From Leeroopedia
Revision as of 17:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Unslothai_Unsloth_RawTextDataLoader.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains NLP, Data_Preprocessing
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for loading and chunking raw text documents into causal LM training datasets provided by the Unsloth library.

Description

The RawTextDataLoader class reads raw text files in multiple formats (txt, md, json, jsonl, csv, tsv, pdf, docx, html, xml, yaml), tokenizes them using a HuggingFace tokenizer, and splits them into overlapping chunks of configurable size. The output is a HuggingFace Dataset with input_ids and attention_mask columns, ready for causal language model training.

Usage

Import this class when performing continued pretraining on raw text corpora. Not needed for conversational/instruction fine-tuning, which uses get_chat_template and standardize_data_formats instead.

Code Reference

Source Location

  • Repository: unsloth
  • File: unsloth/dataprep/raw_text.py
  • Lines: L37-242

Signature

class RawTextDataLoader:
    SUPPORTED_FORMATS = {
        ".txt": "plain_text",
        ".md": "markdown",
        ".json": "json_lines",
        ".jsonl": "json_lines",
        ".csv": "csv_text_column",
    }

    def __init__(
        self,
        tokenizer,
        chunk_size: int = 2048,
        stride: int = 512,
        return_tokenized: bool = True,
    ):
        """
        Args:
            tokenizer: HuggingFace tokenizer for text chunking.
            chunk_size (int): Maximum tokens per chunk. Default 2048.
            stride (int): Overlap tokens between consecutive chunks. Default 512.
            return_tokenized (bool): Return tokenized dicts or raw text. Default True.
        """

    def load_from_file(self, file_path, return_tokenized=None) -> Dataset:
        """Load a single file and return a chunked Dataset."""

    def load_from_files(self, file_paths, return_tokenized=None) -> Dataset:
        """Load multiple files and concatenate into a single Dataset."""

    def smart_chunk_text(self, text, chunk_size, stride, return_tokenized=True):
        """
        Intelligent chunking that:
        1. Respects sentence/paragraph boundaries
        2. Handles various text formats
        3. Maintains context with stride overlap
        4. Returns tokenized chunks directly or text chunks
        """

Import

from unsloth.dataprep.raw_text import RawTextDataLoader

I/O Contract

Inputs

Name Type Required Description
tokenizer PreTrainedTokenizer Yes HuggingFace tokenizer for chunking
chunk_size int No Maximum tokens per chunk (default: 2048)
stride int No Overlap between consecutive chunks (default: 512)
return_tokenized bool No Return tokenized dict or raw text (default: True)
file_path str Yes (for load) Path to raw text file (txt, md, json, jsonl, csv)

Outputs

Name Type Description
dataset datasets.Dataset HuggingFace Dataset with input_ids and attention_mask columns

Usage Examples

Load a Single Text File

from unsloth.dataprep.raw_text import RawTextDataLoader
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B",
    max_seq_length=2048,
    load_in_4bit=True,
)

loader = RawTextDataLoader(
    tokenizer=tokenizer,
    chunk_size=2048,
    stride=512,
)

dataset = loader.load_from_file("domain_corpus.txt")
# dataset has columns: input_ids, attention_mask

Load Multiple Files

loader = RawTextDataLoader(tokenizer=tokenizer, chunk_size=4096, stride=1024)
dataset = loader.load_from_files([
    "chapter1.md",
    "chapter2.md",
    "appendix.json",
])

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment