Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datasets Text Builder

From Leeroopedia
Knowledge Sources
Domains Data_Loading, NLP
Last Updated 2026-02-14 18:00 GMT

Overview

Packaged dataset builder for loading plain text files with line, paragraph, or document sampling.

Description

Text is an ArrowBasedBuilder subclass that loads plain text files into HuggingFace Datasets. It is one of the built-in packaged modules, meaning users can invoke it directly via load_dataset("text", ...) without writing a custom builder script. The builder is configured through TextConfig, a BuilderConfig dataclass that controls encoding, chunking behavior, linebreak handling, and the sampling unit.

The builder supports three sampling modes via the sample_by parameter:

  • "line" (default): Each line of the text file becomes a separate example. Lines are read in configurable chunks (default 10 MB) for memory efficiency.
  • "paragraph": Text is split on double newlines (\n\n), producing one example per paragraph.
  • "document": The entire file content is loaded as a single example.

When keep_linebreaks is False (the default), trailing newline characters are stripped in line mode. The builder supports custom features via the features parameter; if not provided, a single "text" string column is used. Feature casting supports both cheap Arrow schema casts and more expensive storage casts (e.g., string to Audio).

Usage

Use Text when you need to load plain text files (e.g., corpora, log files, raw text data) into a HuggingFace Dataset. It is typically invoked indirectly via load_dataset("text", data_files=...). Choose sample_by="line" for line-delimited data, "paragraph" for paragraph-delimited documents, or "document" to treat each file as a single example.

Code Reference

Source Location

  • Repository: datasets
  • File: src/datasets/packaged_modules/text/text.py
  • Lines: 1-120

Signature

@dataclass
class TextConfig(datasets.BuilderConfig):
    """BuilderConfig for text files."""
    features: Optional[datasets.Features] = None
    encoding: str = "utf-8"
    encoding_errors: Optional[str] = None
    chunksize: int = 10 << 20  # 10MB
    keep_linebreaks: bool = False
    sample_by: str = "line"


class Text(datasets.ArrowBasedBuilder):
    BUILDER_CONFIG_CLASS = TextConfig

    def _info(self):
    def _split_generators(self, dl_manager):
    def _cast_table(self, pa_table: pa.Table) -> pa.Table:
    def _generate_shards(self, base_files, files_iterables):
    def _generate_tables(self, base_files, files_iterables):

Import

# Typically used via load_dataset, not imported directly
from datasets import load_dataset

ds = load_dataset("text", data_files="path/to/file.txt")

I/O Contract

TextConfig Fields

Name Type Default Description
features Optional[datasets.Features] None Explicit schema for the output dataset. If None, a single "text" string column is used.
encoding str "utf-8" Character encoding used to read the text files.
encoding_errors Optional[str] None How to handle encoding errors (e.g., "strict", "ignore", "replace"). Passed to Python's open().
chunksize int 10485760 (10 MB) Number of bytes to read per chunk when sampling by line or paragraph.
keep_linebreaks bool False Whether to preserve trailing newline characters in line mode.
sample_by str "line" Sampling unit: "line", "paragraph", or "document".

Inputs

Name Type Required Description
data_files str, List[str], or Dict[str, str/List[str]] Yes Path(s) to the text file(s) to load.

Outputs

Name Type Description
dataset Dataset Arrow-backed dataset with a "text" column (or custom features if specified).

Usage Examples

Basic Line-by-Line Loading

from datasets import load_dataset

# Load a text file line-by-line
ds = load_dataset("text", data_files="corpus.txt", split="train")
print(ds[0])  # {"text": "First line of the corpus"}

Paragraph Sampling

from datasets import load_dataset

# Load text split by paragraphs (double newlines)
ds = load_dataset("text", data_files="book.txt", sample_by="paragraph", split="train")
print(ds[0])  # {"text": "First paragraph content..."}

Document Mode with Custom Encoding

from datasets import load_dataset

# Load each file as a single document, with latin-1 encoding
ds = load_dataset(
    "text",
    data_files={"train": ["doc1.txt", "doc2.txt"]},
    sample_by="document",
    encoding="latin-1",
)

Keeping Linebreaks

from datasets import load_dataset

# Preserve newline characters in the text
ds = load_dataset("text", data_files="code.py", keep_linebreaks=True, split="train")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment