Implementation:Neuml Txtai HTMLToMarkdown
| Knowledge Sources | |
|---|---|
| Domains | Text_Processing, Data_Extraction |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
HTMLToMarkdown is a pipeline that converts HTML content into clean, structured Markdown text with support for headings, blockquotes, lists, code blocks, tables, and inline formatting.
Description
The HTMLToMarkdown class inherits from Pipeline and uses BeautifulSoup to parse HTML and transform it into well-formatted Markdown. It intelligently locates the most relevant content node by searching for article, main, or body tags, filtering out scripts, styles, navigation, and other non-content elements. The pipeline handles Markdown formatting for headings (h1-h6), blockquotes, ordered and unordered lists, code blocks, tables, bold, italic, and links. It also extracts metadata (title and description) and supports optional paragraph spacing and section break modes.
Usage
Use HTMLToMarkdown when you need to extract readable text from HTML documents for downstream processing such as indexing, summarization, or RAG pipelines. It is particularly suited for converting web articles, documentation pages, and structured HTML into Markdown that preserves the semantic structure of the original content.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: src/python/txtai/pipeline/data/htmltomd.py
- Lines: 1-414
Signature
class HTMLToMarkdown(Pipeline):
def __init__(self, paragraphs=False, sections=False):
"""
Create a new HTMLToMarkdown instance.
Args:
paragraphs: True if paragraph parsing enabled, False otherwise
sections: True if section parsing enabled, False otherwise
"""
def __call__(self, html):
"""
Transforms input HTML into Markdown formatted text.
Args:
html: input html
Returns:
markdown formatted text
"""
Import
from txtai.pipeline import HTMLToMarkdown
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| paragraphs | bool | No | Enables paragraph-level spacing with double newlines after paragraphs and extra spacing after blockquotes and code blocks. Defaults to False. |
| sections | bool | No | Enables section break mode using form feed characters (\f) at headings and page break nodes. Useful for splitting content into logical sections. Defaults to False.
|
| html | str | Yes | Raw HTML string to convert. The pipeline parses the HTML, finds the best content node, and transforms it to Markdown. |
Outputs
| Name | Type | Description |
|---|---|---|
| result | str | Markdown-formatted text extracted from the input HTML. Includes metadata (title, description) if present, followed by the converted content with proper Markdown syntax for headings, lists, tables, code blocks, and inline formatting. |
Usage Examples
Basic Usage
from txtai.pipeline import HTMLToMarkdown
# Create pipeline
md = HTMLToMarkdown()
# Convert HTML to Markdown
html = """
<html>
<head><title>Sample Page</title></head>
<body>
<article>
<h1>Introduction</h1>
<p>This is a <b>bold</b> statement with an <a href="https://example.com">example link</a>.</p>
<h2>Details</h2>
<ul>
<li>First item</li>
<li>Second item</li>
</ul>
<blockquote>An important quote.</blockquote>
</article>
</body>
</html>
"""
result = md(html)
print(result)
Section Parsing
from txtai.pipeline import HTMLToMarkdown
# Enable section breaks for splitting content by headings
md = HTMLToMarkdown(sections=True, paragraphs=True)
html = """
<article>
<h1>Chapter 1</h1>
<p>Content of chapter one.</p>
<h1>Chapter 2</h1>
<p>Content of chapter two.</p>
</article>
"""
result = md(html)
# Sections are separated by form feed characters (\f)
sections = result.split("\f")
for i, section in enumerate(sections):
print(f"Section {i}: {section.strip()[:50]}")