Implementation:Huggingface Datatrove FtfyFormatter

Knowledge Sources	Huggingface_Datatrove
Domains	Data Processing, Text Formatting, Text Encoding
Last Updated	2026-02-14 17:00 GMT

Overview

FTFYFormatter is a text formatter that uses the ftfy (fixes text for you) library to repair encoding errors and mojibake in document text, while intentionally avoiding strict normalization that would reduce character diversity.

Description

FTFYFormatter extends BaseFormatter to apply the ftfy library's text repair capabilities to each document in the pipeline. The formatter is configured with a deliberate philosophy: fixing unreadable or wrong encoding is good, but enforcing a specific or strict formatting is not. This reflects the goal of training language models that can recognize a wide variety of characters and formats, rather than mapping everything to a narrow canonical form.

The formatter enables encoding-related fixes by default: fix_encoding, restore_byte_a0, replace_lossy_sequences, decode_inconsistent_utf8, fix_c1_controls, fix_surrogates, and remove_control_chars. It also enables unescape_html and remove_terminal_escapes as display-related fixes. However, it explicitly disables normalization-oriented options: fix_latin_ligatures (would reduce ligature diversity), fix_character_width (would enforce wrong punctuation width for CJK text), uncurl_quotes (would prevent models from learning curly quotes), fix_line_breaks (borderline), and Unicode normalization (NFC/NFD/NFKC/NFKD are all disabled by default).

The ftfy TextFixerConfig is constructed at initialization time with all the configured options, and the format method simply calls ftfy.fix_text with this config. Both the ftfy import and the TextFixerConfig creation happen at init time, while the actual ftfy module for fixing is imported lazily in the format method.

Usage

Use FTFYFormatter as an early-stage text normalization step in data processing pipelines to repair encoding issues in web-crawled or heterogeneous text corpora. It is typically placed before other formatting or filtering steps to ensure that downstream processing operates on cleanly encoded text.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/formatters/ftfy.py
Lines: 1-61

Signature

class FTFYFormatter(BaseFormatter):
    name = "😎 FTFY"
    _requires_dependencies = ["ftfy"]

    def __init__(
        self,
        unescape_html: str | bool = "auto",
        remove_terminal_escapes: bool = True,
        fix_encoding: bool = True,
        restore_byte_a0: bool = True,
        replace_lossy_sequences: bool = True,
        decode_inconsistent_utf8: bool = True,
        fix_c1_controls: bool = True,
        fix_latin_ligatures: bool = False,
        fix_character_width: bool = False,
        uncurl_quotes: bool = False,
        fix_line_breaks: bool = False,
        fix_surrogates: bool = True,
        remove_control_chars: bool = True,
        normalization: Literal["NFC", "NFD", "NFKC", "NFKD"] | None = None,
    ):
        ...

    def format(self, text: str) -> str:
        ...

Import

from datatrove.pipeline.formatters.ftfy import FTFYFormatter

I/O Contract

Inputs

Name	Type	Required	Description
unescape_html	str or bool	No	Whether to unescape HTML entities (default: "auto")
remove_terminal_escapes	bool	No	Remove terminal/ANSI escape sequences (default: True)
fix_encoding	bool	No	Fix mojibake and encoding errors (default: True)
restore_byte_a0	bool	No	Restore byte 0xA0 (non-breaking space) (default: True)
replace_lossy_sequences	bool	No	Replace lossy byte sequences (default: True)
decode_inconsistent_utf8	bool	No	Decode inconsistent UTF-8 sequences (default: True)
fix_c1_controls	bool	No	Fix C1 control characters (default: True)
fix_latin_ligatures	bool	No	Decompose Latin ligatures (default: False)
fix_character_width	bool	No	Normalize character widths (default: False)
uncurl_quotes	bool	No	Convert curly quotes to straight quotes (default: False)
fix_line_breaks	bool	No	Normalize line break characters (default: False)
fix_surrogates	bool	No	Fix surrogate characters (default: True)
remove_control_chars	bool	No	Remove control characters (default: True)
normalization	str or None	No	Unicode normalization form: NFC, NFD, NFKC, NFKD, or None (default: None)

Outputs

Name	Type	Description
data	DocumentsPipeline (generator)	Yields all documents with encoding-repaired text

Usage Examples

Basic Usage

from datatrove.pipeline.formatters.ftfy import FTFYFormatter

# Use default settings (encoding repair enabled, strict normalization disabled)
formatter = FTFYFormatter()

# Enable additional normalization for specific use cases
strict_formatter = FTFYFormatter(
    fix_latin_ligatures=True,
    uncurl_quotes=True,
    normalization="NFC",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment