Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:PrefectHQ Prefect Parse Article Task

From Leeroopedia


Metadata
Sources Prefect
Domains Web_Scraping
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete task for extracting article text from HTML using BeautifulSoup within a Prefect pipeline.

Description

The parse_article task uses BeautifulSoup to parse HTML, find the main content area (article or main tag), remove code blocks, and extract formatted text from heading and paragraph elements.

Code Reference

@task
def parse_article(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    article = soup.find("article") or soup.find("main")
    if not article:
        return ""
    for code in article.find_all(["pre", "code"]):
        code.decompose()
    content = []
    for elem in article.find_all(["h1", "h2", "h3", "p", "ul", "ol", "li"]):
        text = elem.get_text().strip()
        if not text:
            continue
        if elem.name.startswith("h"):
            content.extend(["\n" + "=" * 80, text.upper(), "=" * 80 + "\n"])
        else:
            content.extend([text, ""])
    return "\n".join(content)
  • Import: from prefect import task; from bs4 import BeautifulSoup

I/O Contract

Inputs

  • html (str, required) — Raw HTML string

Outputs

  • str — Extracted article text with formatted headings; empty string if no article/main found

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment