Implementation:Datajuicer Data juicer TextFormatter
| Knowledge Sources | |
|---|---|
| Domains | Data_Loading, Formatting |
| Last Updated | 2026-02-14 16:00 GMT |
Overview
Concrete tool for loading and formatting text files, source code files, PDFs, and DOCX files as datasets provided by Data-Juicer.
Description
TextFormatter loads and formats plain text files, source code files, PDF files, and DOCX files into Data-Juicer datasets, treating each file as a single text sample. It supports a broad range of 50+ file extensions including .txt, .md, .py, .java, .pdf, .docx, and many programming language source files. For PDF files, uses pdfplumber to extract text while removing tables and page numbers. For DOCX files, uses python-docx to extract paragraph text. Extracted text is cached to disk, then all files are loaded via HuggingFace's load_dataset with sample_by='document' mode (one file equals one sample). Uses multiprocessing.Pool for parallel PDF/DOCX extraction.
Usage
Use when ingesting raw text from plain text files, source code, PDFs, or Word documents for training data preparation. This is the most versatile formatter in the system.
Code Reference
Source Location
- Repository: Datajuicer_Data_juicer
- File:
data_juicer/format/text_formatter.py
Signature
@FORMATTERS.register_module()
class TextFormatter(LocalFormatter):
SUFFIXES = [".docx", ".pdf", ".txt", ".md", ".tex", ".py", ".java", ...] # 50+ extensions
def __init__(self, dataset_path, suffixes=None, add_suffix=False, **kwargs):
def load_dataset(self, num_proc: Optional[int] = None, global_cfg=None) -> Dataset:
def extract_txt_from_docx(fn, tgt_path):
def extract_txt_from_pdf(fn, tgt_path):
Import
from data_juicer.format.text_formatter import TextFormatter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset_path | str | Yes | Path to a text file or directory containing text/code/PDF/DOCX files |
| suffixes | list | No | File suffixes to be processed. Default: 50+ supported extensions |
| add_suffix | bool | No | Whether to add file suffix to dataset meta info. Default: False |
| num_proc | int | No | Number of processes for parallel loading and PDF/DOCX extraction |
| **kwargs | Any | No | Extra arguments passed to the parent LocalFormatter |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | A unified HuggingFace Dataset where each file becomes one text sample |
Usage Examples
from data_juicer.format.text_formatter import TextFormatter
# Load text files from a directory
formatter = TextFormatter(dataset_path="/path/to/documents/")
dataset = formatter.load_dataset(num_proc=4)
# Load only Python source code files
formatter = TextFormatter(
dataset_path="/path/to/code/",
suffixes=[".py"],
add_suffix=True
)
dataset = formatter.load_dataset(num_proc=8)
# Load PDF files (auto-extracts text)
formatter = TextFormatter(
dataset_path="/path/to/pdfs/",
suffixes=[".pdf"]
)
dataset = formatter.load_dataset(num_proc=4)