Implementation:Huggingface Datasets Text Builder
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, NLP |
| Last Updated | 2026-02-14 18:00 GMT |
Overview
Packaged dataset builder for loading plain text files with line, paragraph, or document sampling.
Description
Text is an ArrowBasedBuilder subclass that loads plain text files into HuggingFace Datasets. It is one of the built-in packaged modules, meaning users can invoke it directly via load_dataset("text", ...) without writing a custom builder script. The builder is configured through TextConfig, a BuilderConfig dataclass that controls encoding, chunking behavior, linebreak handling, and the sampling unit.
The builder supports three sampling modes via the sample_by parameter:
- "line" (default): Each line of the text file becomes a separate example. Lines are read in configurable chunks (default 10 MB) for memory efficiency.
- "paragraph": Text is split on double newlines (
\n\n), producing one example per paragraph. - "document": The entire file content is loaded as a single example.
When keep_linebreaks is False (the default), trailing newline characters are stripped in line mode. The builder supports custom features via the features parameter; if not provided, a single "text" string column is used. Feature casting supports both cheap Arrow schema casts and more expensive storage casts (e.g., string to Audio).
Usage
Use Text when you need to load plain text files (e.g., corpora, log files, raw text data) into a HuggingFace Dataset. It is typically invoked indirectly via load_dataset("text", data_files=...). Choose sample_by="line" for line-delimited data, "paragraph" for paragraph-delimited documents, or "document" to treat each file as a single example.
Code Reference
Source Location
- Repository: datasets
- File:
src/datasets/packaged_modules/text/text.py - Lines: 1-120
Signature
@dataclass
class TextConfig(datasets.BuilderConfig):
"""BuilderConfig for text files."""
features: Optional[datasets.Features] = None
encoding: str = "utf-8"
encoding_errors: Optional[str] = None
chunksize: int = 10 << 20 # 10MB
keep_linebreaks: bool = False
sample_by: str = "line"
class Text(datasets.ArrowBasedBuilder):
BUILDER_CONFIG_CLASS = TextConfig
def _info(self):
def _split_generators(self, dl_manager):
def _cast_table(self, pa_table: pa.Table) -> pa.Table:
def _generate_shards(self, base_files, files_iterables):
def _generate_tables(self, base_files, files_iterables):
Import
# Typically used via load_dataset, not imported directly
from datasets import load_dataset
ds = load_dataset("text", data_files="path/to/file.txt")
I/O Contract
TextConfig Fields
| Name | Type | Default | Description |
|---|---|---|---|
| features | Optional[datasets.Features] |
None |
Explicit schema for the output dataset. If None, a single "text" string column is used.
|
| encoding | str |
"utf-8" |
Character encoding used to read the text files. |
| encoding_errors | Optional[str] |
None |
How to handle encoding errors (e.g., "strict", "ignore", "replace"). Passed to Python's open().
|
| chunksize | int |
10485760 (10 MB) |
Number of bytes to read per chunk when sampling by line or paragraph. |
| keep_linebreaks | bool |
False |
Whether to preserve trailing newline characters in line mode. |
| sample_by | str |
"line" |
Sampling unit: "line", "paragraph", or "document".
|
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data_files | str, List[str], or Dict[str, str/List[str]] |
Yes | Path(s) to the text file(s) to load. |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset |
Arrow-backed dataset with a "text" column (or custom features if specified).
|
Usage Examples
Basic Line-by-Line Loading
from datasets import load_dataset
# Load a text file line-by-line
ds = load_dataset("text", data_files="corpus.txt", split="train")
print(ds[0]) # {"text": "First line of the corpus"}
Paragraph Sampling
from datasets import load_dataset
# Load text split by paragraphs (double newlines)
ds = load_dataset("text", data_files="book.txt", sample_by="paragraph", split="train")
print(ds[0]) # {"text": "First paragraph content..."}
Document Mode with Custom Encoding
from datasets import load_dataset
# Load each file as a single document, with latin-1 encoding
ds = load_dataset(
"text",
data_files={"train": ["doc1.txt", "doc2.txt"]},
sample_by="document",
encoding="latin-1",
)
Keeping Linebreaks
from datasets import load_dataset
# Preserve newline characters in the text
ds = load_dataset("text", data_files="code.py", keep_linebreaks=True, split="train")