Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove SymbolLinesFormatter

From Leeroopedia
Knowledge Sources
Domains Data Processing, Text Formatting, Text Cleaning
Last Updated 2026-02-14 17:00 GMT

Overview

SymbolLinesFormatter is a text formatter that removes lines consisting exclusively of punctuation and symbol characters, cleaning up decorative separators and symbol-only noise from document text.

Description

SymbolLinesFormatter extends BaseFormatter to address a common problem in web-crawled and extracted text: lines that contain nothing but punctuation or symbol characters (e.g., "==========", "---***---", "######"). These lines are typically decorative separators, formatting artifacts, or noise that provides no semantic value and can interfere with downstream text processing.

The formatter processes text line-by-line. For each line, it checks whether the line is non-empty and consists entirely of characters from a configurable set of symbols (plus spaces). Lines that meet this criterion are considered "symbol lines" and are either removed entirely or replaced with a configurable replacement character. By default, the symbols_to_remove set is populated from Datatrove's PUNCTUATION_SET constant, and the replace_char is an empty string (meaning symbol lines are simply dropped).

An important detail is the span collapsing behavior: consecutive symbol lines are treated as a single removed span. Only the first symbol line in a consecutive group triggers the optional replacement character; subsequent symbol lines in the same span are silently removed. This prevents a block of multiple separator lines from producing multiple replacement characters. Lines that consist entirely of whitespace are not treated as symbol lines and are preserved as-is.

Usage

Use SymbolLinesFormatter to clean up documents that contain decorative line separators, repeated punctuation lines, or other symbol-only noise. It is particularly useful when processing text extracted from web pages, PDFs, or other formatted sources where visual separators are common.

Code Reference

Source Location

  • Repository: Huggingface_Datatrove
  • File: src/datatrove/pipeline/formatters/symbol_lines_remover.py
  • Lines: 1-36

Signature

class SymbolLinesFormatter(BaseFormatter):
    name = " ⚞ Symbol Lines Remover"

    def __init__(
        self,
        symbols_to_remove: list[str] | None = None,
        replace_char: str = "",
    ):
        ...

    def format(self, text: str) -> str:
        ...

Import

from datatrove.pipeline.formatters.symbol_lines_remover import SymbolLinesFormatter

I/O Contract

Inputs

Name Type Required Description
symbols_to_remove list[str] or None No Set of characters to treat as symbols; defaults to PUNCTUATION_SET
replace_char str No Character to replace symbol line spans with; empty string to remove entirely (default: "")

Outputs

Name Type Description
data DocumentsPipeline (generator) Yields all documents with symbol-only lines removed or replaced

Usage Examples

Basic Usage

from datatrove.pipeline.formatters.symbol_lines_remover import SymbolLinesFormatter

# Remove all punctuation-only lines (default behavior)
formatter = SymbolLinesFormatter()

# Replace symbol lines with a newline (paragraph break) instead of removing them
paragraph_formatter = SymbolLinesFormatter(replace_char="\n")

# Custom set of symbols to detect
custom_formatter = SymbolLinesFormatter(symbols_to_remove=["=", "-", "*", "#"])

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment