Implementation:Huggingface Datatrove SymbolLinesFormatter
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Text Formatting, Text Cleaning |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
SymbolLinesFormatter is a text formatter that removes lines consisting exclusively of punctuation and symbol characters, cleaning up decorative separators and symbol-only noise from document text.
Description
SymbolLinesFormatter extends BaseFormatter to address a common problem in web-crawled and extracted text: lines that contain nothing but punctuation or symbol characters (e.g., "==========", "---***---", "######"). These lines are typically decorative separators, formatting artifacts, or noise that provides no semantic value and can interfere with downstream text processing.
The formatter processes text line-by-line. For each line, it checks whether the line is non-empty and consists entirely of characters from a configurable set of symbols (plus spaces). Lines that meet this criterion are considered "symbol lines" and are either removed entirely or replaced with a configurable replacement character. By default, the symbols_to_remove set is populated from Datatrove's PUNCTUATION_SET constant, and the replace_char is an empty string (meaning symbol lines are simply dropped).
An important detail is the span collapsing behavior: consecutive symbol lines are treated as a single removed span. Only the first symbol line in a consecutive group triggers the optional replacement character; subsequent symbol lines in the same span are silently removed. This prevents a block of multiple separator lines from producing multiple replacement characters. Lines that consist entirely of whitespace are not treated as symbol lines and are preserved as-is.
Usage
Use SymbolLinesFormatter to clean up documents that contain decorative line separators, repeated punctuation lines, or other symbol-only noise. It is particularly useful when processing text extracted from web pages, PDFs, or other formatted sources where visual separators are common.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/formatters/symbol_lines_remover.py
- Lines: 1-36
Signature
class SymbolLinesFormatter(BaseFormatter):
name = " ⚞ Symbol Lines Remover"
def __init__(
self,
symbols_to_remove: list[str] | None = None,
replace_char: str = "",
):
...
def format(self, text: str) -> str:
...
Import
from datatrove.pipeline.formatters.symbol_lines_remover import SymbolLinesFormatter
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| symbols_to_remove | list[str] or None | No | Set of characters to treat as symbols; defaults to PUNCTUATION_SET |
| replace_char | str | No | Character to replace symbol line spans with; empty string to remove entirely (default: "") |
Outputs
| Name | Type | Description |
|---|---|---|
| data | DocumentsPipeline (generator) | Yields all documents with symbol-only lines removed or replaced |
Usage Examples
Basic Usage
from datatrove.pipeline.formatters.symbol_lines_remover import SymbolLinesFormatter
# Remove all punctuation-only lines (default behavior)
formatter = SymbolLinesFormatter()
# Replace symbol lines with a newline (paragraph break) instead of removing them
paragraph_formatter = SymbolLinesFormatter(replace_char="\n")
# Custom set of symbols to detect
custom_formatter = SymbolLinesFormatter(symbols_to_remove=["=", "-", "*", "#"])