Implementation:Neuml Txtai FileToHTML
| Knowledge Sources | |
|---|---|
| Domains | Data Processing, Document Conversion, HTML |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete tool for converting files to HTML provided by txtai.
Description
FileToHTML is a pipeline that converts document files (such as PDF, DOCX, PPTX, and other formats) into HTML content. It supports two extraction backends: Apache Tika (a Java-based document parser) and Docling (a Python-based document converter). The pipeline automatically selects the first available backend by default, or a specific backend can be requested. The Tika backend requires a Java runtime and skips plain text and HTML files (returning None). The Docling backend detects and skips HTML files, normalizes the output by wrapping content in body tags, removing bullets from list items, and adding spacing between paragraphs. The resulting HTML output can be further processed by the HTMLToMarkdown pipeline for text extraction.
Usage
Use FileToHTML when you need to extract content from document files such as PDFs, Word documents, PowerPoint presentations, or other office formats. This is the first step in a content extraction pipeline that can be followed by HTMLToMarkdown for converting the resulting HTML to clean text. It is useful for document indexing, content search, and document processing workflows.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: src/python/txtai/pipeline/data/filetohtml.py
Signature
class FileToHTML(Pipeline):
def __init__(self, backend="available")
def __call__(self, path)
Import
from txtai.pipeline.data.filetohtml import FileToHTML
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| backend | str | No | Backend to use for content extraction. Supports "tika", "docling", or "available" (default). When "available", the first available backend is selected (Tika is preferred over Docling). |
| path | str | Yes | File path to the document to convert to HTML. |
Outputs
| Name | Type | Description |
|---|---|---|
| result | str or None | HTML content extracted from the file. Returns None if no backend is available, if the file is already plain text/HTML (Tika backend), or if the file is detected as HTML (Docling backend). |
Usage Examples
from txtai.pipeline import FileToHTML
# Create a FileToHTML pipeline (auto-selects available backend)
converter = FileToHTML()
# Convert a PDF to HTML
html = converter("document.pdf")
# Use a specific backend
converter_tika = FileToHTML(backend="tika")
html = converter_tika("report.docx")
converter_docling = FileToHTML(backend="docling")
html = converter_docling("presentation.pptx")
# Chain with HTMLToMarkdown for full text extraction
from txtai.pipeline import HTMLToMarkdown
converter = FileToHTML()
md = HTMLToMarkdown()
html = converter("document.pdf")
if html:
text = md(html)