Implementation:Neuml Txtai FileToHTML

Knowledge Sources	Neuml_Txtai
Domains	Data Processing, Document Conversion, HTML
Last Updated	2026-02-10 01:00 GMT

Overview

Concrete tool for converting files to HTML provided by txtai.

Description

FileToHTML is a pipeline that converts document files (such as PDF, DOCX, PPTX, and other formats) into HTML content. It supports two extraction backends: Apache Tika (a Java-based document parser) and Docling (a Python-based document converter). The pipeline automatically selects the first available backend by default, or a specific backend can be requested. The Tika backend requires a Java runtime and skips plain text and HTML files (returning None). The Docling backend detects and skips HTML files, normalizes the output by wrapping content in body tags, removing bullets from list items, and adding spacing between paragraphs. The resulting HTML output can be further processed by the HTMLToMarkdown pipeline for text extraction.

Usage

Use FileToHTML when you need to extract content from document files such as PDFs, Word documents, PowerPoint presentations, or other office formats. This is the first step in a content extraction pipeline that can be followed by HTMLToMarkdown for converting the resulting HTML to clean text. It is useful for document indexing, content search, and document processing workflows.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/pipeline/data/filetohtml.py

Signature

class FileToHTML(Pipeline):
    def __init__(self, backend="available")
    def __call__(self, path)

Import

from txtai.pipeline.data.filetohtml import FileToHTML

I/O Contract

Inputs

Name	Type	Required	Description
backend	str	No	Backend to use for content extraction. Supports "tika", "docling", or "available" (default). When "available", the first available backend is selected (Tika is preferred over Docling).
path	str	Yes	File path to the document to convert to HTML.

Outputs

Name	Type	Description
result	str or None	HTML content extracted from the file. Returns None if no backend is available, if the file is already plain text/HTML (Tika backend), or if the file is detected as HTML (Docling backend).

Usage Examples

from txtai.pipeline import FileToHTML

# Create a FileToHTML pipeline (auto-selects available backend)
converter = FileToHTML()

# Convert a PDF to HTML
html = converter("document.pdf")

# Use a specific backend
converter_tika = FileToHTML(backend="tika")
html = converter_tika("report.docx")

converter_docling = FileToHTML(backend="docling")
html = converter_docling("presentation.pptx")

# Chain with HTMLToMarkdown for full text extraction
from txtai.pipeline import HTMLToMarkdown

converter = FileToHTML()
md = HTMLToMarkdown()

html = converter("document.pdf")
if html:
    text = md(html)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment