Implementation:Unslothai Unsloth RawTextDataLoader

Knowledge Sources	Unsloth
Domains	NLP, Data_Preprocessing
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for loading and chunking raw text documents into causal LM training datasets provided by the Unsloth library.

Description

The RawTextDataLoader class reads raw text files in multiple formats (txt, md, json, jsonl, csv, tsv, pdf, docx, html, xml, yaml), tokenizes them using a HuggingFace tokenizer, and splits them into overlapping chunks of configurable size. The output is a HuggingFace Dataset with input_ids and attention_mask columns, ready for causal language model training.

Usage

Import this class when performing continued pretraining on raw text corpora. Not needed for conversational/instruction fine-tuning, which uses get_chat_template and standardize_data_formats instead.

Code Reference

Source Location

Repository: unsloth
File: unsloth/dataprep/raw_text.py
Lines: L37-242

Signature

class RawTextDataLoader:
    SUPPORTED_FORMATS = {
        ".txt": "plain_text",
        ".md": "markdown",
        ".json": "json_lines",
        ".jsonl": "json_lines",
        ".csv": "csv_text_column",
    }

    def __init__(
        self,
        tokenizer,
        chunk_size: int = 2048,
        stride: int = 512,
        return_tokenized: bool = True,
    ):
        """
        Args:
            tokenizer: HuggingFace tokenizer for text chunking.
            chunk_size (int): Maximum tokens per chunk. Default 2048.
            stride (int): Overlap tokens between consecutive chunks. Default 512.
            return_tokenized (bool): Return tokenized dicts or raw text. Default True.
        """

    def load_from_file(self, file_path, return_tokenized=None) -> Dataset:
        """Load a single file and return a chunked Dataset."""

    def load_from_files(self, file_paths, return_tokenized=None) -> Dataset:
        """Load multiple files and concatenate into a single Dataset."""

    def smart_chunk_text(self, text, chunk_size, stride, return_tokenized=True):
        """
        Intelligent chunking that:
        1. Respects sentence/paragraph boundaries
        2. Handles various text formats
        3. Maintains context with stride overlap
        4. Returns tokenized chunks directly or text chunks
        """

Import

from unsloth.dataprep.raw_text import RawTextDataLoader

I/O Contract

Inputs

Name	Type	Required	Description
tokenizer	PreTrainedTokenizer	Yes	HuggingFace tokenizer for chunking
chunk_size	int	No	Maximum tokens per chunk (default: 2048)
stride	int	No	Overlap between consecutive chunks (default: 512)
return_tokenized	bool	No	Return tokenized dict or raw text (default: True)
file_path	str	Yes (for load)	Path to raw text file (txt, md, json, jsonl, csv)

Outputs

Name	Type	Description
dataset	datasets.Dataset	HuggingFace Dataset with input_ids and attention_mask columns

Usage Examples

Load a Single Text File

from unsloth.dataprep.raw_text import RawTextDataLoader
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B",
    max_seq_length=2048,
    load_in_4bit=True,
)

loader = RawTextDataLoader(
    tokenizer=tokenizer,
    chunk_size=2048,
    stride=512,
)

dataset = loader.load_from_file("domain_corpus.txt")
# dataset has columns: input_ids, attention_mask

Load Multiple Files

loader = RawTextDataLoader(tokenizer=tokenizer, chunk_size=4096, stride=1024)
dataset = loader.load_from_files([
    "chapter1.md",
    "chapter2.md",
    "appendix.json",
])

Related Pages

Implements Principle

Principle:Unslothai_Unsloth_Raw_Text_Data_Loading

Requires Environment

Environment:Unslothai_Unsloth_Python_Transformers

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment