Implementation:Unslothai Unsloth RawTextDataLoader
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Preprocessing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for loading and chunking raw text documents into causal LM training datasets provided by the Unsloth library.
Description
The RawTextDataLoader class reads raw text files in multiple formats (txt, md, json, jsonl, csv, tsv, pdf, docx, html, xml, yaml), tokenizes them using a HuggingFace tokenizer, and splits them into overlapping chunks of configurable size. The output is a HuggingFace Dataset with input_ids and attention_mask columns, ready for causal language model training.
Usage
Import this class when performing continued pretraining on raw text corpora. Not needed for conversational/instruction fine-tuning, which uses get_chat_template and standardize_data_formats instead.
Code Reference
Source Location
- Repository: unsloth
- File: unsloth/dataprep/raw_text.py
- Lines: L37-242
Signature
class RawTextDataLoader:
SUPPORTED_FORMATS = {
".txt": "plain_text",
".md": "markdown",
".json": "json_lines",
".jsonl": "json_lines",
".csv": "csv_text_column",
}
def __init__(
self,
tokenizer,
chunk_size: int = 2048,
stride: int = 512,
return_tokenized: bool = True,
):
"""
Args:
tokenizer: HuggingFace tokenizer for text chunking.
chunk_size (int): Maximum tokens per chunk. Default 2048.
stride (int): Overlap tokens between consecutive chunks. Default 512.
return_tokenized (bool): Return tokenized dicts or raw text. Default True.
"""
def load_from_file(self, file_path, return_tokenized=None) -> Dataset:
"""Load a single file and return a chunked Dataset."""
def load_from_files(self, file_paths, return_tokenized=None) -> Dataset:
"""Load multiple files and concatenate into a single Dataset."""
def smart_chunk_text(self, text, chunk_size, stride, return_tokenized=True):
"""
Intelligent chunking that:
1. Respects sentence/paragraph boundaries
2. Handles various text formats
3. Maintains context with stride overlap
4. Returns tokenized chunks directly or text chunks
"""
Import
from unsloth.dataprep.raw_text import RawTextDataLoader
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokenizer | PreTrainedTokenizer | Yes | HuggingFace tokenizer for chunking |
| chunk_size | int | No | Maximum tokens per chunk (default: 2048) |
| stride | int | No | Overlap between consecutive chunks (default: 512) |
| return_tokenized | bool | No | Return tokenized dict or raw text (default: True) |
| file_path | str | Yes (for load) | Path to raw text file (txt, md, json, jsonl, csv) |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | datasets.Dataset | HuggingFace Dataset with input_ids and attention_mask columns |
Usage Examples
Load a Single Text File
from unsloth.dataprep.raw_text import RawTextDataLoader
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-1B",
max_seq_length=2048,
load_in_4bit=True,
)
loader = RawTextDataLoader(
tokenizer=tokenizer,
chunk_size=2048,
stride=512,
)
dataset = loader.load_from_file("domain_corpus.txt")
# dataset has columns: input_ids, attention_mask
Load Multiple Files
loader = RawTextDataLoader(tokenizer=tokenizer, chunk_size=4096, stride=1024)
dataset = loader.load_from_files([
"chapter1.md",
"chapter2.md",
"appendix.json",
])