Implementation:Neuml Txtai HTMLToMarkdown

Knowledge Sources	Neuml_Txtai
Domains	Text_Processing, Data_Extraction
Last Updated	2026-02-09 17:00 GMT

Overview

HTMLToMarkdown is a pipeline that converts HTML content into clean, structured Markdown text with support for headings, blockquotes, lists, code blocks, tables, and inline formatting.

Description

The HTMLToMarkdown class inherits from Pipeline and uses BeautifulSoup to parse HTML and transform it into well-formatted Markdown. It intelligently locates the most relevant content node by searching for article, main, or body tags, filtering out scripts, styles, navigation, and other non-content elements. The pipeline handles Markdown formatting for headings (h1-h6), blockquotes, ordered and unordered lists, code blocks, tables, bold, italic, and links. It also extracts metadata (title and description) and supports optional paragraph spacing and section break modes.

Usage

Use HTMLToMarkdown when you need to extract readable text from HTML documents for downstream processing such as indexing, summarization, or RAG pipelines. It is particularly suited for converting web articles, documentation pages, and structured HTML into Markdown that preserves the semantic structure of the original content.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/pipeline/data/htmltomd.py
Lines: 1-414

Signature

class HTMLToMarkdown(Pipeline):
    def __init__(self, paragraphs=False, sections=False):
        """
        Create a new HTMLToMarkdown instance.

        Args:
            paragraphs: True if paragraph parsing enabled, False otherwise
            sections: True if section parsing enabled, False otherwise
        """

    def __call__(self, html):
        """
        Transforms input HTML into Markdown formatted text.

        Args:
            html: input html

        Returns:
            markdown formatted text
        """

Import

from txtai.pipeline import HTMLToMarkdown

I/O Contract

Inputs

Name	Type	Required	Description
paragraphs	bool	No	Enables paragraph-level spacing with double newlines after paragraphs and extra spacing after blockquotes and code blocks. Defaults to False.
sections	bool	No	Enables section break mode using form feed characters (`\f`) at headings and page break nodes. Useful for splitting content into logical sections. Defaults to False.
html	str	Yes	Raw HTML string to convert. The pipeline parses the HTML, finds the best content node, and transforms it to Markdown.

Outputs

Name	Type	Description
result	str	Markdown-formatted text extracted from the input HTML. Includes metadata (title, description) if present, followed by the converted content with proper Markdown syntax for headings, lists, tables, code blocks, and inline formatting.

Usage Examples

Basic Usage

from txtai.pipeline import HTMLToMarkdown

# Create pipeline
md = HTMLToMarkdown()

# Convert HTML to Markdown
html = """
<html>
<head><title>Sample Page</title></head>
<body>
<article>
    <h1>Introduction</h1>
    <p>This is a <b>bold</b> statement with an <a href="https://example.com">example link</a>.</p>
    <h2>Details</h2>
    <ul>
        <li>First item</li>
        <li>Second item</li>
    </ul>
    <blockquote>An important quote.</blockquote>
</article>
</body>
</html>
"""

result = md(html)
print(result)

Section Parsing

from txtai.pipeline import HTMLToMarkdown

# Enable section breaks for splitting content by headings
md = HTMLToMarkdown(sections=True, paragraphs=True)

html = """
<article>
    <h1>Chapter 1</h1>
    <p>Content of chapter one.</p>
    <h1>Chapter 2</h1>
    <p>Content of chapter two.</p>
</article>
"""

result = md(html)
# Sections are separated by form feed characters (\f)
sections = result.split("\f")
for i, section in enumerate(sections):
    print(f"Section {i}: {section.strip()[:50]}")

Related Pages

Principle:Neuml_Txtai_Content_Conversion

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment