Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove FtfyFormatter

From Leeroopedia
Knowledge Sources
Domains Data Processing, Text Formatting, Text Encoding
Last Updated 2026-02-14 17:00 GMT

Overview

FTFYFormatter is a text formatter that uses the ftfy (fixes text for you) library to repair encoding errors and mojibake in document text, while intentionally avoiding strict normalization that would reduce character diversity.

Description

FTFYFormatter extends BaseFormatter to apply the ftfy library's text repair capabilities to each document in the pipeline. The formatter is configured with a deliberate philosophy: fixing unreadable or wrong encoding is good, but enforcing a specific or strict formatting is not. This reflects the goal of training language models that can recognize a wide variety of characters and formats, rather than mapping everything to a narrow canonical form.

The formatter enables encoding-related fixes by default: fix_encoding, restore_byte_a0, replace_lossy_sequences, decode_inconsistent_utf8, fix_c1_controls, fix_surrogates, and remove_control_chars. It also enables unescape_html and remove_terminal_escapes as display-related fixes. However, it explicitly disables normalization-oriented options: fix_latin_ligatures (would reduce ligature diversity), fix_character_width (would enforce wrong punctuation width for CJK text), uncurl_quotes (would prevent models from learning curly quotes), fix_line_breaks (borderline), and Unicode normalization (NFC/NFD/NFKC/NFKD are all disabled by default).

The ftfy TextFixerConfig is constructed at initialization time with all the configured options, and the format method simply calls ftfy.fix_text with this config. Both the ftfy import and the TextFixerConfig creation happen at init time, while the actual ftfy module for fixing is imported lazily in the format method.

Usage

Use FTFYFormatter as an early-stage text normalization step in data processing pipelines to repair encoding issues in web-crawled or heterogeneous text corpora. It is typically placed before other formatting or filtering steps to ensure that downstream processing operates on cleanly encoded text.

Code Reference

Source Location

Signature

class FTFYFormatter(BaseFormatter):
    name = "😎 FTFY"
    _requires_dependencies = ["ftfy"]

    def __init__(
        self,
        unescape_html: str | bool = "auto",
        remove_terminal_escapes: bool = True,
        fix_encoding: bool = True,
        restore_byte_a0: bool = True,
        replace_lossy_sequences: bool = True,
        decode_inconsistent_utf8: bool = True,
        fix_c1_controls: bool = True,
        fix_latin_ligatures: bool = False,
        fix_character_width: bool = False,
        uncurl_quotes: bool = False,
        fix_line_breaks: bool = False,
        fix_surrogates: bool = True,
        remove_control_chars: bool = True,
        normalization: Literal["NFC", "NFD", "NFKC", "NFKD"] | None = None,
    ):
        ...

    def format(self, text: str) -> str:
        ...

Import

from datatrove.pipeline.formatters.ftfy import FTFYFormatter

I/O Contract

Inputs

Name Type Required Description
unescape_html str or bool No Whether to unescape HTML entities (default: "auto")
remove_terminal_escapes bool No Remove terminal/ANSI escape sequences (default: True)
fix_encoding bool No Fix mojibake and encoding errors (default: True)
restore_byte_a0 bool No Restore byte 0xA0 (non-breaking space) (default: True)
replace_lossy_sequences bool No Replace lossy byte sequences (default: True)
decode_inconsistent_utf8 bool No Decode inconsistent UTF-8 sequences (default: True)
fix_c1_controls bool No Fix C1 control characters (default: True)
fix_latin_ligatures bool No Decompose Latin ligatures (default: False)
fix_character_width bool No Normalize character widths (default: False)
uncurl_quotes bool No Convert curly quotes to straight quotes (default: False)
fix_line_breaks bool No Normalize line break characters (default: False)
fix_surrogates bool No Fix surrogate characters (default: True)
remove_control_chars bool No Remove control characters (default: True)
normalization str or None No Unicode normalization form: NFC, NFD, NFKC, NFKD, or None (default: None)

Outputs

Name Type Description
data DocumentsPipeline (generator) Yields all documents with encoding-repaired text

Usage Examples

Basic Usage

from datatrove.pipeline.formatters.ftfy import FTFYFormatter

# Use default settings (encoding repair enabled, strict normalization disabled)
formatter = FTFYFormatter()

# Enable additional normalization for specific use cases
strict_formatter = FTFYFormatter(
    fix_latin_ligatures=True,
    uncurl_quotes=True,
    normalization="NFC",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment