Implementation:NVIDIA NeMo Curator MarkdownRemover
| Knowledge Sources | |
|---|---|
| Domains | Data_Curation, Text_Cleaning, Markdown_Processing |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
MarkdownRemover is a document modifier that strips Markdown formatting syntax from text, extracting plain text content from bold, italic, underline, and link markup.
Description
MarkdownRemover extends DocumentModifier and uses regular expression substitutions to remove four types of Markdown formatting from each line of a document:
- Bold:
**text**is replaced withtext(regex:\*\*(.*?)\*\*) - Italic:
*text*is replaced withtext(regex:\*(.*?)\*) - Underline:
_text_is replaced withtext(regex:_(.*?)_) - Links:
[text](url)is replaced withurl(regex:\[.*?\]\((.*?)\))
The regexes are applied in order (bold before italic, to correctly handle the overlapping * syntax). Each regex uses non-greedy matching (.*?) to capture the innermost content. For links, the URL portion is preserved while the display text and syntax brackets are removed. Processing is done line by line: the text is split on newlines, each line is processed through all four regex substitutions, and the lines are rejoined.
Usage
Use MarkdownRemover when processing documents that contain Markdown formatting and downstream stages require clean plain text without formatting markup. This is common in web-scraped data or content extracted from Markdown-based CMS systems.
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/text/modifiers/markdown_remover.py - Lines: 1-44
Signature
class MarkdownRemover(DocumentModifier):
def __init__(self):
...
def modify_document(self, text: str) -> str:
...
Module-Level Constants
MARKDOWN_BOLD_REGEX = r"\*\*(.*?)\*\*"
MARKDOWN_ITALIC_REGEX = r"\*(.*?)\*"
MARKDOWN_UNDERLINE_REGEX = r"_(.*?)_"
MARKDOWN_LINK_REGEX = r"\[.*?\]\((.*?)\)"
Import
from nemo_curator.stages.text.modifiers.markdown_remover import MarkdownRemover
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | str |
Yes | The document text containing Markdown formatting, passed to modify_document().
|
Outputs
| Name | Type | Description |
|---|---|---|
| return value | str |
The document text with Markdown formatting removed. Bold, italic, and underline markers are stripped, and link syntax is replaced with the URL. |
Usage Examples
Basic Usage
from nemo_curator.stages.text.modifiers.markdown_remover import MarkdownRemover
modifier = MarkdownRemover()
text = "This is **bold** and *italic* text with a [link](https://example.com)."
result = modifier.modify_document(text)
# Returns "This is bold and italic text with a https://example.com."
Multi-Line Document
from nemo_curator.stages.text.modifiers.markdown_remover import MarkdownRemover
modifier = MarkdownRemover()
text = """# Heading
This has **bold** words.
And _underlined_ text too.
See [docs](https://docs.example.com) for details."""
result = modifier.modify_document(text)
# Each line is processed independently
# Bold, underline, and link syntax are removed
# Note: heading markers (#) are NOT removed by this modifier
Related Pages
- NVIDIA_NeMo_Curator_DocumentModifier — Base class that this modifier extends
- NVIDIA_NeMo_Curator_UrlRemover — Complementary modifier that removes URLs entirely rather than preserving them
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base