Implementation:Huggingface Datasets Text Builder

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Loading, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Packaged dataset builder for loading plain text files with line, paragraph, or document sampling.

Description

Text is an ArrowBasedBuilder subclass that loads plain text files into HuggingFace Datasets. It is one of the built-in packaged modules, meaning users can invoke it directly via load_dataset("text", ...) without writing a custom builder script. The builder is configured through TextConfig, a BuilderConfig dataclass that controls encoding, chunking behavior, linebreak handling, and the sampling unit.

The builder supports three sampling modes via the sample_by parameter:

"line" (default): Each line of the text file becomes a separate example. Lines are read in configurable chunks (default 10 MB) for memory efficiency.
"paragraph": Text is split on double newlines (\n\n), producing one example per paragraph.
"document": The entire file content is loaded as a single example.

When keep_linebreaks is False (the default), trailing newline characters are stripped in line mode. The builder supports custom features via the features parameter; if not provided, a single "text" string column is used. Feature casting supports both cheap Arrow schema casts and more expensive storage casts (e.g., string to Audio).

Usage

Use Text when you need to load plain text files (e.g., corpora, log files, raw text data) into a HuggingFace Dataset. It is typically invoked indirectly via load_dataset("text", data_files=...). Choose sample_by="line" for line-delimited data, "paragraph" for paragraph-delimited documents, or "document" to treat each file as a single example.

Code Reference

Source Location

Repository: datasets
File: src/datasets/packaged_modules/text/text.py
Lines: 1-120

Signature

@dataclass
class TextConfig(datasets.BuilderConfig):
    """BuilderConfig for text files."""
    features: Optional[datasets.Features] = None
    encoding: str = "utf-8"
    encoding_errors: Optional[str] = None
    chunksize: int = 10 << 20  # 10MB
    keep_linebreaks: bool = False
    sample_by: str = "line"


class Text(datasets.ArrowBasedBuilder):
    BUILDER_CONFIG_CLASS = TextConfig

    def _info(self):
    def _split_generators(self, dl_manager):
    def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    def _generate_shards(self, base_files, files_iterables):
    def _generate_tables(self, base_files, files_iterables):

Import

# Typically used via load_dataset, not imported directly
from datasets import load_dataset

ds = load_dataset("text", data_files="path/to/file.txt")

I/O Contract

TextConfig Fields

Name	Type	Default	Description
features	`Optional[datasets.Features]`	`None`	Explicit schema for the output dataset. If None, a single `"text"` string column is used.
encoding	`str`	`"utf-8"`	Character encoding used to read the text files.
encoding_errors	`Optional[str]`	`None`	How to handle encoding errors (e.g., `"strict"`, `"ignore"`, `"replace"`). Passed to Python's `open()`.
chunksize	`int`	`10485760` (10 MB)	Number of bytes to read per chunk when sampling by line or paragraph.
keep_linebreaks	`bool`	`False`	Whether to preserve trailing newline characters in line mode.
sample_by	`str`	`"line"`	Sampling unit: `"line"`, `"paragraph"`, or `"document"`.

Inputs

Name	Type	Required	Description
data_files	`str`, `List[str]`, or `Dict[str, str/List[str]]`	Yes	Path(s) to the text file(s) to load.

Outputs

Name	Type	Description
dataset	`Dataset`	Arrow-backed dataset with a `"text"` column (or custom features if specified).

Usage Examples

Basic Line-by-Line Loading

from datasets import load_dataset

# Load a text file line-by-line
ds = load_dataset("text", data_files="corpus.txt", split="train")
print(ds[0])  # {"text": "First line of the corpus"}

Paragraph Sampling

from datasets import load_dataset

# Load text split by paragraphs (double newlines)
ds = load_dataset("text", data_files="book.txt", sample_by="paragraph", split="train")
print(ds[0])  # {"text": "First paragraph content..."}

Document Mode with Custom Encoding

from datasets import load_dataset

# Load each file as a single document, with latin-1 encoding
ds = load_dataset(
    "text",
    data_files={"train": ["doc1.txt", "doc2.txt"]},
    sample_by="document",
    encoding="latin-1",
)

Keeping Linebreaks

from datasets import load_dataset

# Preserve newline characters in the text
ds = load_dataset("text", data_files="code.py", keep_linebreaks=True, split="train")

Related Pages

Implements Principle

Principle:Huggingface_Datasets_Text_Dataset_Building

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment