Principle:PrefectHQ Prefect HTTP Data Extraction
| Metadata | |
|---|---|
| Sources | |
| Domains | |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A pattern for reliably extracting data from REST APIs with pagination support, automatic retries on transient failures, and structured error handling.
Description
HTTP Data Extraction is the "Extract" phase of ETL pipelines. It involves:
- Fetching data from external REST APIs -- making HTTP GET requests to retrieve structured data
- Handling pagination -- iterating through multiple pages of results to collect complete datasets
- Managing rate limits -- respecting API rate limits through retry delays and backoff
- Recovering from transient network failures -- automatically retrying failed requests without re-fetching successful pages
In Prefect, this is implemented by wrapping HTTP client calls in @task with retries, allowing each page fetch to be independently retried without re-fetching pages that already succeeded. This is a critical design choice: if page 5 of 10 fails, only page 5 is retried -- pages 1-4 do not need to be re-fetched.
The typical extraction pattern follows this structure:
@task(retries=3, retry_delay_seconds=[2, 5, 15])
def fetch_page(page: int, api_base: str, per_page: int) -> list[dict]:
response = httpx.get(f"{api_base}/endpoint", params={"page": page, "per_page": per_page})
response.raise_for_status()
return response.json()
@flow
def extract_all(api_base: str, total_pages: int, per_page: int):
results = []
for page in range(1, total_pages + 1):
results.append(fetch_page(page, api_base, per_page))
return results
Usage
Use this pattern when building data pipelines that consume data from REST APIs, especially when:
- The API returns paginated results that must be iterated through
- Network reliability is a concern (the API may be temporarily unavailable)
- Rate limiting is enforced by the API provider
- You need observability into which pages succeeded or failed
- The dataset is large enough that re-fetching everything on failure would be wasteful
Common use cases include:
- Ingesting articles or posts from content APIs (DEV.to, Medium, WordPress)
- Fetching records from SaaS APIs (Salesforce, HubSpot, Stripe)
- Collecting metrics from monitoring APIs (Datadog, CloudWatch)
- Pulling data from government or public data APIs
Theoretical Basis
The pattern combines three fundamental principles:
1. Pagination
REST APIs typically limit the number of results per response. Pagination involves iterating through pages to collect the complete dataset.
# Pseudocode: Paginated extraction
collected = []
for page in range(1, total_pages + 1):
data = fetch_page(page)
collected.append(data)
2. Retry with Backoff
Automatically retrying failed HTTP requests with increasing delays prevents overwhelming a struggling service while maximizing the chance of eventual success.
| Attempt | Delay | Cumulative Wait |
|---|---|---|
| 1st retry | 2 seconds | 2 seconds |
| 2nd retry | 5 seconds | 7 seconds |
| 3rd retry | 15 seconds | 22 seconds |
3. Idempotent Operations
Each page fetch is independent and idempotent -- fetching page 3 produces the same result regardless of whether pages 1 and 2 succeeded or failed. This property makes it safe to retry individual page fetches without side effects.
Combined pseudocode:
for each page in range(1, total_pages + 1):
data = retry(fetch_page(page)) # Each page retried independently
collected.append(data)
This combination ensures that the extraction phase is both complete (all pages are eventually fetched) and efficient (only failed pages are retried).