Workflow:PrefectHQ Prefect API Sourced ETL
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, ETL |
| Last Updated | 2026-02-09 22:00 GMT |
Overview
End-to-end process for building an ETL pipeline that fetches JSON from a public REST API, transforms it into a structured pandas DataFrame, and loads the result to CSV using Prefect tasks with automatic retries and observability.
Description
This workflow demonstrates the standard Extract-Transform-Load pattern using Prefect. It fetches paginated JSON data from a REST API, normalises the raw records into a tidy pandas DataFrame by selecting and flattening relevant fields, and persists the output as a CSV file. Each step is wrapped in a Prefect task with configurable retries and exponential backoff, providing resilience against transient network failures. The flow orchestrator coordinates the tasks sequentially and surfaces all print statements as structured logs.
Key outputs:
- A CSV file containing cleaned, analytics-ready data
- Full observability of each extraction, transformation, and load step in the Prefect UI
Scope:
- From paginated API JSON responses to a single CSV on disk
- Handles multiple pages of data with per-page retry logic
Usage
Execute this workflow when you need to pull structured data from a REST API on a recurring basis and produce a flat-file output for downstream analytics or BI tools. It is suitable when you have a paginated JSON API endpoint and need a reliable, retryable pipeline that anyone can run locally or schedule in Prefect Cloud.
Execution Steps
Step 1: Configure Pipeline Parameters
Define the target API base URL, the number of pages to fetch, the page size, and the output file path. These parameters are passed to the flow as arguments, enabling reuse across different API endpoints and configurations.
Key considerations:
- Respect API rate limits by adjusting page count and size
- Use a configurable output path for environment portability
Step 2: Extract Data from API
Fetch each page of JSON data from the REST API endpoint. Each page request is an independent task invocation with automatic retries (3 attempts with escalating delays of 2, 5, and 15 seconds). The HTTP client enforces a 30-second timeout per request.
Key considerations:
- Each page fetch is a separate Prefect task run for granular observability
- Retries with exponential backoff handle transient network failures
- The task raises on non-2xx status codes to trigger retry logic
Step 3: Transform to DataFrame
Combine all fetched pages into a single list of records, then normalise the nested JSON into a flat pandas DataFrame. Select only the columns relevant to the analysis (e.g., identifiers, timestamps, engagement metrics, author information).
Key considerations:
- Use pandas json_normalize for consistent flattening of nested structures
- Column selection ensures only relevant fields are carried forward
- This step runs as a single task since it operates on in-memory data
Step 4: Load to CSV
Persist the transformed DataFrame to a CSV file on disk and log a preview of the first rows. This step confirms successful pipeline completion and provides a quick data quality check via the logged preview.
Key considerations:
- The output path is configurable for different environments
- A preview is logged to help catch data quality issues early
- The CSV is written without an index column for clean downstream consumption