Implementation:PrefectHQ Prefect Parse Article Task

Metadata
Sources	Prefect
Domains	Web_Scraping
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete task for extracting article text from HTML using BeautifulSoup within a Prefect pipeline.

Description

The parse_article task uses BeautifulSoup to parse HTML, find the main content area (article or main tag), remove code blocks, and extract formatted text from heading and paragraph elements.

Code Reference

Repository: https://github.com/PrefectHQ/prefect
File: examples/simple_web_scraper.py (L55-83)
Signature:

@task
def parse_article(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    article = soup.find("article") or soup.find("main")
    if not article:
        return ""
    for code in article.find_all(["pre", "code"]):
        code.decompose()
    content = []
    for elem in article.find_all(["h1", "h2", "h3", "p", "ul", "ol", "li"]):
        text = elem.get_text().strip()
        if not text:
            continue
        if elem.name.startswith("h"):
            content.extend(["\n" + "=" * 80, text.upper(), "=" * 80 + "\n"])
        else:
            content.extend([text, ""])
    return "\n".join(content)

Import: from prefect import task; from bs4 import BeautifulSoup

I/O Contract

Inputs

html (str, required) — Raw HTML string

Outputs

str — Extracted article text with formatted headings; empty string if no article/main found

Related Pages

Principle:PrefectHQ_Prefect_HTML_Parsing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment