Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:PrefectHQ Prefect HTTP Data Extraction

From Leeroopedia


Metadata
Sources
Domains
Last Updated 2026-02-09 00:00 GMT

Overview

A pattern for reliably extracting data from REST APIs with pagination support, automatic retries on transient failures, and structured error handling.

Description

HTTP Data Extraction is the "Extract" phase of ETL pipelines. It involves:

  • Fetching data from external REST APIs -- making HTTP GET requests to retrieve structured data
  • Handling pagination -- iterating through multiple pages of results to collect complete datasets
  • Managing rate limits -- respecting API rate limits through retry delays and backoff
  • Recovering from transient network failures -- automatically retrying failed requests without re-fetching successful pages

In Prefect, this is implemented by wrapping HTTP client calls in @task with retries, allowing each page fetch to be independently retried without re-fetching pages that already succeeded. This is a critical design choice: if page 5 of 10 fails, only page 5 is retried -- pages 1-4 do not need to be re-fetched.

The typical extraction pattern follows this structure:

@task(retries=3, retry_delay_seconds=[2, 5, 15])
def fetch_page(page: int, api_base: str, per_page: int) -> list[dict]:
    response = httpx.get(f"{api_base}/endpoint", params={"page": page, "per_page": per_page})
    response.raise_for_status()
    return response.json()

@flow
def extract_all(api_base: str, total_pages: int, per_page: int):
    results = []
    for page in range(1, total_pages + 1):
        results.append(fetch_page(page, api_base, per_page))
    return results

Usage

Use this pattern when building data pipelines that consume data from REST APIs, especially when:

  • The API returns paginated results that must be iterated through
  • Network reliability is a concern (the API may be temporarily unavailable)
  • Rate limiting is enforced by the API provider
  • You need observability into which pages succeeded or failed
  • The dataset is large enough that re-fetching everything on failure would be wasteful

Common use cases include:

  • Ingesting articles or posts from content APIs (DEV.to, Medium, WordPress)
  • Fetching records from SaaS APIs (Salesforce, HubSpot, Stripe)
  • Collecting metrics from monitoring APIs (Datadog, CloudWatch)
  • Pulling data from government or public data APIs

Theoretical Basis

The pattern combines three fundamental principles:

1. Pagination

REST APIs typically limit the number of results per response. Pagination involves iterating through pages to collect the complete dataset.

# Pseudocode: Paginated extraction
collected = []
for page in range(1, total_pages + 1):
    data = fetch_page(page)
    collected.append(data)

2. Retry with Backoff

Automatically retrying failed HTTP requests with increasing delays prevents overwhelming a struggling service while maximizing the chance of eventual success.

Attempt Delay Cumulative Wait
1st retry 2 seconds 2 seconds
2nd retry 5 seconds 7 seconds
3rd retry 15 seconds 22 seconds

3. Idempotent Operations

Each page fetch is independent and idempotent -- fetching page 3 produces the same result regardless of whether pages 1 and 2 succeeded or failed. This property makes it safe to retry individual page fetches without side effects.

Combined pseudocode:

for each page in range(1, total_pages + 1):
    data = retry(fetch_page(page))  # Each page retried independently
    collected.append(data)

This combination ensures that the extraction phase is both complete (all pages are eventually fetched) and efficient (only failed pages are retried).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment