Implementation:Huggingface Datatrove BaseFormatter
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Text Formatting |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
BaseFormatter is the abstract base class for all text formatting pipeline steps in the Datatrove framework, defining the interface for in-place text transformation of documents.
Description
BaseFormatter extends both PipelineStep and Python's ABC to establish the contract that every concrete formatter must implement the format method. Unlike filters which make binary keep/drop decisions, formatters transform document text in place: the format method receives a string and returns a modified string, and the run method applies this transformation to every document's text field.
The class is intentionally minimal at 23 lines. The run method iterates over the incoming DocumentsPipeline, tracks document counts via StatHints.total, wraps each format call in track_time for performance monitoring, replaces doc.text with the formatted result, and yields the document downstream. Every document passes through; formatters never drop documents.
This architecture cleanly separates the what (the text transformation logic in subclasses) from the how (the pipeline iteration, statistics tracking, and timing in the base class). Subclasses only need to implement a pure function from string to string, making them easy to test and compose.
Usage
Use BaseFormatter as the parent class when implementing any custom text formatting step in a Datatrove pipeline. Subclasses only need to implement the format method. This class should never be instantiated directly.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/formatters/base.py
- Lines: 1-23
Signature
class BaseFormatter(PipelineStep, ABC):
type = "✂️ - FORMAT"
def __init__(self):
...
@abstractmethod
def format(self, text: str) -> str:
...
def run(self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1) -> DocumentsPipeline:
...
Import
from datatrove.pipeline.formatters.base import BaseFormatter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data | DocumentsPipeline | Yes | Incoming stream of documents to format (provided by the run method) |
Outputs
| Name | Type | Description |
|---|---|---|
| data | DocumentsPipeline (generator) | Yields all documents with their text field replaced by the formatted version |
Usage Examples
Basic Usage
from datatrove.pipeline.formatters.base import BaseFormatter
class LowercaseFormatter(BaseFormatter):
name = "Lowercase"
def format(self, text: str) -> str:
return text.lower()