Implementation:PrefectHQ Prefect Parse Article Task
Appearance
| Metadata | |
|---|---|
| Sources | Prefect |
| Domains | Web_Scraping |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete task for extracting article text from HTML using BeautifulSoup within a Prefect pipeline.
Description
The parse_article task uses BeautifulSoup to parse HTML, find the main content area (article or main tag), remove code blocks, and extract formatted text from heading and paragraph elements.
Code Reference
- Repository: https://github.com/PrefectHQ/prefect
- File: examples/simple_web_scraper.py (L55-83)
- Signature:
@task
def parse_article(html: str) -> str:
soup = BeautifulSoup(html, "html.parser")
article = soup.find("article") or soup.find("main")
if not article:
return ""
for code in article.find_all(["pre", "code"]):
code.decompose()
content = []
for elem in article.find_all(["h1", "h2", "h3", "p", "ul", "ol", "li"]):
text = elem.get_text().strip()
if not text:
continue
if elem.name.startswith("h"):
content.extend(["\n" + "=" * 80, text.upper(), "=" * 80 + "\n"])
else:
content.extend([text, ""])
return "\n".join(content)
- Import: from prefect import task; from bs4 import BeautifulSoup
I/O Contract
Inputs
- html (str, required) — Raw HTML string
Outputs
- str — Extracted article text with formatted headings; empty string if no article/main found
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment