Principle:Spotify Luigi External API Integration
| Knowledge Sources | |
|---|---|
| Domains | Data_Integration, API |
| Last Updated | 2026-02-10 08:00 GMT |
Overview
Connecting pipeline steps to external service APIs for extracting and loading data from third-party systems.
Description
External API integration is the practice of incorporating calls to third-party web service APIs as first-class steps within a data pipeline. Modern data workflows frequently need to pull data from SaaS platforms (CRM systems, marketing tools, cloud services) or push processed results back into external systems. Rather than handling this through ad-hoc scripts outside the pipeline, API integration tasks are managed by the orchestrator like any other pipeline step, complete with dependency tracking, retry logic, and completion checking. This approach treats external API endpoints as data sources or sinks, wrapping the complexities of authentication (OAuth, API keys), pagination, rate limiting, and error handling within a reusable task abstraction.
Usage
Use external API integration when the pipeline needs to extract data from SaaS platforms such as CRM systems, advertising platforms, or cloud service APIs, or when pipeline results must be pushed into external systems. It is appropriate whenever data ingestion or export involves HTTP-based service communication with authentication and structured request/response patterns.
Theoretical Basis
External API integration in data pipelines follows the extract-transform-load (ETL) pattern applied to web service endpoints:
1. Authentication -- Establish an authenticated session with the external service. Common mechanisms include:
* API Key -- A static token passed in request headers or query parameters
* OAuth 2.0 -- Token-based authentication requiring an initial authorization flow and periodic token refresh
* Session-based -- Username/password authentication that returns a session token
2. Data Extraction -- Retrieve data from the API through one or more requests. Key considerations include:
* Pagination -- APIs typically return data in pages. The extractor must iterate through all pages, following cursor tokens or offset parameters until all data is retrieved.
* Rate Limiting -- APIs impose request rate limits. The extractor must respect these by implementing backoff strategies:
IF response status = 429 (Too Many Requests) THEN wait for retry-after duration AND retry
* Incremental Extraction -- Where supported, request only data that has changed since the last extraction using timestamps or change tokens, reducing data volume and API load.
3. Data Normalization -- Transform API responses (typically JSON or XML) into a format suitable for downstream pipeline processing. This may involve flattening nested structures, type conversion, and schema alignment.
4. Idempotency -- API calls should be designed to be safely retried. For read operations this is inherent; for write operations the API should support idempotency keys or upsert semantics.
5. Error Handling -- Transient errors (network timeouts, server errors) trigger retries with exponential backoff. Permanent errors (authentication failures, invalid requests) cause immediate task failure with diagnostic information.
6. Completion Tracking -- The pipeline records successful extraction or loading as a target marker, preventing duplicate API calls on re-runs.
The fundamental design principle is encapsulation: the complexities of API communication are hidden behind the task interface, presenting a clean data-in/data-out abstraction to the rest of the pipeline.