Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai HTMLToMarkdown

From Leeroopedia


Knowledge Sources
Domains Text_Processing, Data_Extraction
Last Updated 2026-02-09 17:00 GMT

Overview

HTMLToMarkdown is a pipeline that converts HTML content into clean, structured Markdown text with support for headings, blockquotes, lists, code blocks, tables, and inline formatting.

Description

The HTMLToMarkdown class inherits from Pipeline and uses BeautifulSoup to parse HTML and transform it into well-formatted Markdown. It intelligently locates the most relevant content node by searching for article, main, or body tags, filtering out scripts, styles, navigation, and other non-content elements. The pipeline handles Markdown formatting for headings (h1-h6), blockquotes, ordered and unordered lists, code blocks, tables, bold, italic, and links. It also extracts metadata (title and description) and supports optional paragraph spacing and section break modes.

Usage

Use HTMLToMarkdown when you need to extract readable text from HTML documents for downstream processing such as indexing, summarization, or RAG pipelines. It is particularly suited for converting web articles, documentation pages, and structured HTML into Markdown that preserves the semantic structure of the original content.

Code Reference

Source Location

Signature

class HTMLToMarkdown(Pipeline):
    def __init__(self, paragraphs=False, sections=False):
        """
        Create a new HTMLToMarkdown instance.

        Args:
            paragraphs: True if paragraph parsing enabled, False otherwise
            sections: True if section parsing enabled, False otherwise
        """

    def __call__(self, html):
        """
        Transforms input HTML into Markdown formatted text.

        Args:
            html: input html

        Returns:
            markdown formatted text
        """

Import

from txtai.pipeline import HTMLToMarkdown

I/O Contract

Inputs

Name Type Required Description
paragraphs bool No Enables paragraph-level spacing with double newlines after paragraphs and extra spacing after blockquotes and code blocks. Defaults to False.
sections bool No Enables section break mode using form feed characters (\f) at headings and page break nodes. Useful for splitting content into logical sections. Defaults to False.
html str Yes Raw HTML string to convert. The pipeline parses the HTML, finds the best content node, and transforms it to Markdown.

Outputs

Name Type Description
result str Markdown-formatted text extracted from the input HTML. Includes metadata (title, description) if present, followed by the converted content with proper Markdown syntax for headings, lists, tables, code blocks, and inline formatting.

Usage Examples

Basic Usage

from txtai.pipeline import HTMLToMarkdown

# Create pipeline
md = HTMLToMarkdown()

# Convert HTML to Markdown
html = """
<html>
<head><title>Sample Page</title></head>
<body>
<article>
    <h1>Introduction</h1>
    <p>This is a <b>bold</b> statement with an <a href="https://example.com">example link</a>.</p>
    <h2>Details</h2>
    <ul>
        <li>First item</li>
        <li>Second item</li>
    </ul>
    <blockquote>An important quote.</blockquote>
</article>
</body>
</html>
"""

result = md(html)
print(result)

Section Parsing

from txtai.pipeline import HTMLToMarkdown

# Enable section breaks for splitting content by headings
md = HTMLToMarkdown(sections=True, paragraphs=True)

html = """
<article>
    <h1>Chapter 1</h1>
    <p>Content of chapter one.</p>
    <h1>Chapter 2</h1>
    <p>Content of chapter two.</p>
</article>
"""

result = md(html)
# Sections are separated by form feed characters (\f)
sections = result.split("\f")
for i, section in enumerate(sections):
    print(f"Section {i}: {section.strip()[:50]}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment