Implementation:Huggingface Datatrove FtfyFormatter
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Text Formatting, Text Encoding |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
FTFYFormatter is a text formatter that uses the ftfy (fixes text for you) library to repair encoding errors and mojibake in document text, while intentionally avoiding strict normalization that would reduce character diversity.
Description
FTFYFormatter extends BaseFormatter to apply the ftfy library's text repair capabilities to each document in the pipeline. The formatter is configured with a deliberate philosophy: fixing unreadable or wrong encoding is good, but enforcing a specific or strict formatting is not. This reflects the goal of training language models that can recognize a wide variety of characters and formats, rather than mapping everything to a narrow canonical form.
The formatter enables encoding-related fixes by default: fix_encoding, restore_byte_a0, replace_lossy_sequences, decode_inconsistent_utf8, fix_c1_controls, fix_surrogates, and remove_control_chars. It also enables unescape_html and remove_terminal_escapes as display-related fixes. However, it explicitly disables normalization-oriented options: fix_latin_ligatures (would reduce ligature diversity), fix_character_width (would enforce wrong punctuation width for CJK text), uncurl_quotes (would prevent models from learning curly quotes), fix_line_breaks (borderline), and Unicode normalization (NFC/NFD/NFKC/NFKD are all disabled by default).
The ftfy TextFixerConfig is constructed at initialization time with all the configured options, and the format method simply calls ftfy.fix_text with this config. Both the ftfy import and the TextFixerConfig creation happen at init time, while the actual ftfy module for fixing is imported lazily in the format method.
Usage
Use FTFYFormatter as an early-stage text normalization step in data processing pipelines to repair encoding issues in web-crawled or heterogeneous text corpora. It is typically placed before other formatting or filtering steps to ensure that downstream processing operates on cleanly encoded text.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/formatters/ftfy.py
- Lines: 1-61
Signature
class FTFYFormatter(BaseFormatter):
name = "😎 FTFY"
_requires_dependencies = ["ftfy"]
def __init__(
self,
unescape_html: str | bool = "auto",
remove_terminal_escapes: bool = True,
fix_encoding: bool = True,
restore_byte_a0: bool = True,
replace_lossy_sequences: bool = True,
decode_inconsistent_utf8: bool = True,
fix_c1_controls: bool = True,
fix_latin_ligatures: bool = False,
fix_character_width: bool = False,
uncurl_quotes: bool = False,
fix_line_breaks: bool = False,
fix_surrogates: bool = True,
remove_control_chars: bool = True,
normalization: Literal["NFC", "NFD", "NFKC", "NFKD"] | None = None,
):
...
def format(self, text: str) -> str:
...
Import
from datatrove.pipeline.formatters.ftfy import FTFYFormatter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| unescape_html | str or bool | No | Whether to unescape HTML entities (default: "auto") |
| remove_terminal_escapes | bool | No | Remove terminal/ANSI escape sequences (default: True) |
| fix_encoding | bool | No | Fix mojibake and encoding errors (default: True) |
| restore_byte_a0 | bool | No | Restore byte 0xA0 (non-breaking space) (default: True) |
| replace_lossy_sequences | bool | No | Replace lossy byte sequences (default: True) |
| decode_inconsistent_utf8 | bool | No | Decode inconsistent UTF-8 sequences (default: True) |
| fix_c1_controls | bool | No | Fix C1 control characters (default: True) |
| fix_latin_ligatures | bool | No | Decompose Latin ligatures (default: False) |
| fix_character_width | bool | No | Normalize character widths (default: False) |
| uncurl_quotes | bool | No | Convert curly quotes to straight quotes (default: False) |
| fix_line_breaks | bool | No | Normalize line break characters (default: False) |
| fix_surrogates | bool | No | Fix surrogate characters (default: True) |
| remove_control_chars | bool | No | Remove control characters (default: True) |
| normalization | str or None | No | Unicode normalization form: NFC, NFD, NFKC, NFKD, or None (default: None) |
Outputs
| Name | Type | Description |
|---|---|---|
| data | DocumentsPipeline (generator) | Yields all documents with encoding-repaired text |
Usage Examples
Basic Usage
from datatrove.pipeline.formatters.ftfy import FTFYFormatter
# Use default settings (encoding repair enabled, strict normalization disabled)
formatter = FTFYFormatter()
# Enable additional normalization for specific use cases
strict_formatter = FTFYFormatter(
fix_latin_ligatures=True,
uncurl_quotes=True,
normalization="NFC",
)