Workflow:PrefectHQ Prefect API Sourced ETL

Knowledge Sources	Prefect Prefect Docs httpx Docs pandas Docs
Domains	Data_Engineering, ETL
Last Updated	2026-02-09 22:00 GMT

Overview

End-to-end process for building an ETL pipeline that fetches JSON from a public REST API, transforms it into a structured pandas DataFrame, and loads the result to CSV using Prefect tasks with automatic retries and observability.

Description

This workflow demonstrates the standard Extract-Transform-Load pattern using Prefect. It fetches paginated JSON data from a REST API, normalises the raw records into a tidy pandas DataFrame by selecting and flattening relevant fields, and persists the output as a CSV file. Each step is wrapped in a Prefect task with configurable retries and exponential backoff, providing resilience against transient network failures. The flow orchestrator coordinates the tasks sequentially and surfaces all print statements as structured logs.

Key outputs:

A CSV file containing cleaned, analytics-ready data
Full observability of each extraction, transformation, and load step in the Prefect UI

Scope:

From paginated API JSON responses to a single CSV on disk
Handles multiple pages of data with per-page retry logic

Usage

Execute this workflow when you need to pull structured data from a REST API on a recurring basis and produce a flat-file output for downstream analytics or BI tools. It is suitable when you have a paginated JSON API endpoint and need a reliable, retryable pipeline that anyone can run locally or schedule in Prefect Cloud.

Execution Steps

Step 1: Configure Pipeline Parameters

Define the target API base URL, the number of pages to fetch, the page size, and the output file path. These parameters are passed to the flow as arguments, enabling reuse across different API endpoints and configurations.

Key considerations:

Respect API rate limits by adjusting page count and size
Use a configurable output path for environment portability

Step 2: Extract Data from API

Fetch each page of JSON data from the REST API endpoint. Each page request is an independent task invocation with automatic retries (3 attempts with escalating delays of 2, 5, and 15 seconds). The HTTP client enforces a 30-second timeout per request.

Key considerations:

Each page fetch is a separate Prefect task run for granular observability
Retries with exponential backoff handle transient network failures
The task raises on non-2xx status codes to trigger retry logic

Step 3: Transform to DataFrame

Combine all fetched pages into a single list of records, then normalise the nested JSON into a flat pandas DataFrame. Select only the columns relevant to the analysis (e.g., identifiers, timestamps, engagement metrics, author information).

Key considerations:

Use pandas json_normalize for consistent flattening of nested structures
Column selection ensures only relevant fields are carried forward
This step runs as a single task since it operates on in-memory data

Step 4: Load to CSV

Persist the transformed DataFrame to a CSV file on disk and log a preview of the first rows. This step confirms successful pipeline completion and provides a quick data quality check via the logged preview.

Key considerations:

The output path is configurable for different environments
A preview is logged to help catch data quality issues early
The CSV is written without an index column for clean downstream consumption

Execution Diagram

GitHub URL

Workflow Repository