Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai FileToHTML

From Leeroopedia


Knowledge Sources
Domains Data Processing, Document Conversion, HTML
Last Updated 2026-02-10 01:00 GMT

Overview

Concrete tool for converting files to HTML provided by txtai.

Description

FileToHTML is a pipeline that converts document files (such as PDF, DOCX, PPTX, and other formats) into HTML content. It supports two extraction backends: Apache Tika (a Java-based document parser) and Docling (a Python-based document converter). The pipeline automatically selects the first available backend by default, or a specific backend can be requested. The Tika backend requires a Java runtime and skips plain text and HTML files (returning None). The Docling backend detects and skips HTML files, normalizes the output by wrapping content in body tags, removing bullets from list items, and adding spacing between paragraphs. The resulting HTML output can be further processed by the HTMLToMarkdown pipeline for text extraction.

Usage

Use FileToHTML when you need to extract content from document files such as PDFs, Word documents, PowerPoint presentations, or other office formats. This is the first step in a content extraction pipeline that can be followed by HTMLToMarkdown for converting the resulting HTML to clean text. It is useful for document indexing, content search, and document processing workflows.

Code Reference

Source Location

  • Repository: Neuml_Txtai
  • File: src/python/txtai/pipeline/data/filetohtml.py

Signature

class FileToHTML(Pipeline):
    def __init__(self, backend="available")
    def __call__(self, path)

Import

from txtai.pipeline.data.filetohtml import FileToHTML

I/O Contract

Inputs

Name Type Required Description
backend str No Backend to use for content extraction. Supports "tika", "docling", or "available" (default). When "available", the first available backend is selected (Tika is preferred over Docling).
path str Yes File path to the document to convert to HTML.

Outputs

Name Type Description
result str or None HTML content extracted from the file. Returns None if no backend is available, if the file is already plain text/HTML (Tika backend), or if the file is detected as HTML (Docling backend).

Usage Examples

from txtai.pipeline import FileToHTML

# Create a FileToHTML pipeline (auto-selects available backend)
converter = FileToHTML()

# Convert a PDF to HTML
html = converter("document.pdf")

# Use a specific backend
converter_tika = FileToHTML(backend="tika")
html = converter_tika("report.docx")

converter_docling = FileToHTML(backend="docling")
html = converter_docling("presentation.pptx")

# Chain with HTMLToMarkdown for full text extraction
from txtai.pipeline import HTMLToMarkdown

converter = FileToHTML()
md = HTMLToMarkdown()

html = converter("document.pdf")
if html:
    text = md(html)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment