Principle:PrefectHQ Prefect HTTP Data Extraction

Metadata
Sources	Prefect Tasks
Domains	ETL Data_Engineering
Last Updated	2026-02-09 00:00 GMT

Overview

A pattern for reliably extracting data from REST APIs with pagination support, automatic retries on transient failures, and structured error handling.

Description

HTTP Data Extraction is the "Extract" phase of ETL pipelines. It involves:

Fetching data from external REST APIs -- making HTTP GET requests to retrieve structured data
Handling pagination -- iterating through multiple pages of results to collect complete datasets
Managing rate limits -- respecting API rate limits through retry delays and backoff
Recovering from transient network failures -- automatically retrying failed requests without re-fetching successful pages

In Prefect, this is implemented by wrapping HTTP client calls in @task with retries, allowing each page fetch to be independently retried without re-fetching pages that already succeeded. This is a critical design choice: if page 5 of 10 fails, only page 5 is retried -- pages 1-4 do not need to be re-fetched.

The typical extraction pattern follows this structure:

@task(retries=3, retry_delay_seconds=[2, 5, 15])
def fetch_page(page: int, api_base: str, per_page: int) -> list[dict]:
    response = httpx.get(f"{api_base}/endpoint", params={"page": page, "per_page": per_page})
    response.raise_for_status()
    return response.json()

@flow
def extract_all(api_base: str, total_pages: int, per_page: int):
    results = []
    for page in range(1, total_pages + 1):
        results.append(fetch_page(page, api_base, per_page))
    return results

Usage

Use this pattern when building data pipelines that consume data from REST APIs, especially when:

The API returns paginated results that must be iterated through
Network reliability is a concern (the API may be temporarily unavailable)
Rate limiting is enforced by the API provider
You need observability into which pages succeeded or failed
The dataset is large enough that re-fetching everything on failure would be wasteful

Common use cases include:

Ingesting articles or posts from content APIs (DEV.to, Medium, WordPress)
Fetching records from SaaS APIs (Salesforce, HubSpot, Stripe)
Collecting metrics from monitoring APIs (Datadog, CloudWatch)
Pulling data from government or public data APIs

Theoretical Basis

The pattern combines three fundamental principles:

1. Pagination

REST APIs typically limit the number of results per response. Pagination involves iterating through pages to collect the complete dataset.

# Pseudocode: Paginated extraction
collected = []
for page in range(1, total_pages + 1):
    data = fetch_page(page)
    collected.append(data)

2. Retry with Backoff

Automatically retrying failed HTTP requests with increasing delays prevents overwhelming a struggling service while maximizing the chance of eventual success.

Attempt	Delay	Cumulative Wait
1st retry	2 seconds	2 seconds
2nd retry	5 seconds	7 seconds
3rd retry	15 seconds	22 seconds

3. Idempotent Operations

Each page fetch is independent and idempotent -- fetching page 3 produces the same result regardless of whether pages 1 and 2 succeeded or failed. This property makes it safe to retry individual page fetches without side effects.

Combined pseudocode:

for each page in range(1, total_pages + 1):
    data = retry(fetch_page(page))  # Each page retried independently
    collected.append(data)

This combination ensures that the extraction phase is both complete (all pages are eventually fetched) and efficient (only failed pages are retried).

Related Pages

Implementation:PrefectHQ_Prefect_Fetch_Page_Task

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment