Principle:DataTalksClub Data engineering zoomcamp Loading Path Selection
| Page Metadata | |
|---|---|
| Knowledge Sources | dlt docs: dlt Documentation, Design Patterns: Strategy Pattern |
| Domains | Data_Engineering, Data_Ingestion |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Loading path selection is the application of the strategy pattern to data ingestion, allowing runtime selection between alternative data loading approaches based on infrastructure availability and performance requirements.
Description
Data pipelines often need to support multiple methods for moving data from source to destination. The choice between methods depends on factors that are not known at development time -- available cloud infrastructure, network constraints, data volume, and operational preferences. Loading path selection provides a mechanism for the operator to choose the most appropriate loading strategy at runtime without modifying the pipeline code.
In the context of data ingestion, two common strategies are:
- Staged loading -- Data is first downloaded from the source and uploaded to an intermediate storage layer (such as a cloud object store), then a framework reads from that intermediate location and loads it into the final destination. This approach provides checkpointing, enables retries from the staging layer, and can leverage optimized bulk-load connectors between cloud storage and data warehouses.
- Direct streaming -- Data is fetched from the source and loaded directly into the destination without an intermediate staging layer. This approach reduces storage costs and latency but offers fewer recovery options if a load fails partway through.
The strategy pattern encapsulates each loading approach behind a common interface (in this case, a generator function that yields data records). The pipeline execution layer does not need to know which strategy was selected -- it simply consumes records from whatever source is provided.
Key design considerations include:
- Interface consistency -- All strategies must produce output in a format the pipeline can consume
- Resource allocation -- Different strategies may require different credentials, permissions, or infrastructure
- Error handling -- Each strategy has its own failure modes and retry semantics
- Performance characteristics -- Staged loading may be faster for very large datasets due to bulk-load optimizations, while direct streaming avoids the overhead of intermediate storage
Usage
Use loading path selection when:
- The pipeline must support multiple deployment environments with different infrastructure
- Operators need flexibility to choose between cost, speed, and reliability tradeoffs
- The data source supports multiple access patterns (direct download, cloud storage, API)
- The team wants to experiment with different loading strategies without branching the codebase
Theoretical Basis
The strategy pattern for loading path selection follows this conceptual structure:
FUNCTION create_data_source(strategy, source_urls, config):
IF strategy == STAGED:
stage = upload_to_intermediate_storage(source_urls, config.storage)
source = create_reader_from_storage(stage)
ELSE IF strategy == DIRECT:
source = create_streaming_reader(source_urls)
ELSE:
RAISE invalid_strategy_error
RETURN source
-- At runtime:
selected_strategy = get_user_selection()
data_source = create_data_source(selected_strategy, urls, config)
pipeline.run(data_source)
The essential property of this pattern is that the pipeline.run() call is strategy-agnostic. The pipeline receives a data source that conforms to an expected interface (a generator of records) regardless of whether that source reads from cloud storage or streams directly from the web. This decoupling means new strategies can be added in the future (e.g., reading from a local file cache, reading from a message queue) without modifying the pipeline execution logic.
The branching point where the strategy is selected should be as early as possible in the pipeline, before any strategy-specific resources are allocated. This prevents unnecessary initialization (e.g., creating a cloud storage client when direct streaming is selected).