Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator MarkdownRemover

From Leeroopedia
Knowledge Sources
Domains Data_Curation, Text_Cleaning, Markdown_Processing
Last Updated 2026-02-14 00:00 GMT

Overview

MarkdownRemover is a document modifier that strips Markdown formatting syntax from text, extracting plain text content from bold, italic, underline, and link markup.

Description

MarkdownRemover extends DocumentModifier and uses regular expression substitutions to remove four types of Markdown formatting from each line of a document:

  • Bold: **text** is replaced with text (regex: \*\*(.*?)\*\*)
  • Italic: *text* is replaced with text (regex: \*(.*?)\*)
  • Underline: _text_ is replaced with text (regex: _(.*?)_)
  • Links: [text](url) is replaced with url (regex: \[.*?\]\((.*?)\))

The regexes are applied in order (bold before italic, to correctly handle the overlapping * syntax). Each regex uses non-greedy matching (.*?) to capture the innermost content. For links, the URL portion is preserved while the display text and syntax brackets are removed. Processing is done line by line: the text is split on newlines, each line is processed through all four regex substitutions, and the lines are rejoined.

Usage

Use MarkdownRemover when processing documents that contain Markdown formatting and downstream stages require clean plain text without formatting markup. This is common in web-scraped data or content extracted from Markdown-based CMS systems.

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/text/modifiers/markdown_remover.py
  • Lines: 1-44

Signature

class MarkdownRemover(DocumentModifier):
    def __init__(self):
        ...

    def modify_document(self, text: str) -> str:
        ...

Module-Level Constants

MARKDOWN_BOLD_REGEX = r"\*\*(.*?)\*\*"
MARKDOWN_ITALIC_REGEX = r"\*(.*?)\*"
MARKDOWN_UNDERLINE_REGEX = r"_(.*?)_"
MARKDOWN_LINK_REGEX = r"\[.*?\]\((.*?)\)"

Import

from nemo_curator.stages.text.modifiers.markdown_remover import MarkdownRemover

I/O Contract

Inputs

Name Type Required Description
text str Yes The document text containing Markdown formatting, passed to modify_document().

Outputs

Name Type Description
return value str The document text with Markdown formatting removed. Bold, italic, and underline markers are stripped, and link syntax is replaced with the URL.

Usage Examples

Basic Usage

from nemo_curator.stages.text.modifiers.markdown_remover import MarkdownRemover

modifier = MarkdownRemover()

text = "This is **bold** and *italic* text with a [link](https://example.com)."
result = modifier.modify_document(text)
# Returns "This is bold and italic text with a https://example.com."

Multi-Line Document

from nemo_curator.stages.text.modifiers.markdown_remover import MarkdownRemover

modifier = MarkdownRemover()

text = """# Heading
This has **bold** words.
And _underlined_ text too.
See [docs](https://docs.example.com) for details."""

result = modifier.modify_document(text)
# Each line is processed independently
# Bold, underline, and link syntax are removed
# Note: heading markers (#) are NOT removed by this modifier

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment