Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DataTalksClub Data engineering zoomcamp Loading Path Selection

From Leeroopedia


Page Metadata
Knowledge Sources dlt docs: dlt Documentation, Design Patterns: Strategy Pattern
Domains Data_Engineering, Data_Ingestion
Last Updated 2026-02-09 14:00 GMT

Overview

Loading path selection is the application of the strategy pattern to data ingestion, allowing runtime selection between alternative data loading approaches based on infrastructure availability and performance requirements.

Description

Data pipelines often need to support multiple methods for moving data from source to destination. The choice between methods depends on factors that are not known at development time -- available cloud infrastructure, network constraints, data volume, and operational preferences. Loading path selection provides a mechanism for the operator to choose the most appropriate loading strategy at runtime without modifying the pipeline code.

In the context of data ingestion, two common strategies are:

  • Staged loading -- Data is first downloaded from the source and uploaded to an intermediate storage layer (such as a cloud object store), then a framework reads from that intermediate location and loads it into the final destination. This approach provides checkpointing, enables retries from the staging layer, and can leverage optimized bulk-load connectors between cloud storage and data warehouses.
  • Direct streaming -- Data is fetched from the source and loaded directly into the destination without an intermediate staging layer. This approach reduces storage costs and latency but offers fewer recovery options if a load fails partway through.

The strategy pattern encapsulates each loading approach behind a common interface (in this case, a generator function that yields data records). The pipeline execution layer does not need to know which strategy was selected -- it simply consumes records from whatever source is provided.

Key design considerations include:

  • Interface consistency -- All strategies must produce output in a format the pipeline can consume
  • Resource allocation -- Different strategies may require different credentials, permissions, or infrastructure
  • Error handling -- Each strategy has its own failure modes and retry semantics
  • Performance characteristics -- Staged loading may be faster for very large datasets due to bulk-load optimizations, while direct streaming avoids the overhead of intermediate storage

Usage

Use loading path selection when:

  • The pipeline must support multiple deployment environments with different infrastructure
  • Operators need flexibility to choose between cost, speed, and reliability tradeoffs
  • The data source supports multiple access patterns (direct download, cloud storage, API)
  • The team wants to experiment with different loading strategies without branching the codebase

Theoretical Basis

The strategy pattern for loading path selection follows this conceptual structure:

FUNCTION create_data_source(strategy, source_urls, config):
    IF strategy == STAGED:
        stage = upload_to_intermediate_storage(source_urls, config.storage)
        source = create_reader_from_storage(stage)
    ELSE IF strategy == DIRECT:
        source = create_streaming_reader(source_urls)
    ELSE:
        RAISE invalid_strategy_error

    RETURN source

-- At runtime:
selected_strategy = get_user_selection()
data_source = create_data_source(selected_strategy, urls, config)
pipeline.run(data_source)

The essential property of this pattern is that the pipeline.run() call is strategy-agnostic. The pipeline receives a data source that conforms to an expected interface (a generator of records) regardless of whether that source reads from cloud storage or streams directly from the web. This decoupling means new strategies can be added in the future (e.g., reading from a local file cache, reading from a message queue) without modifying the pipeline execution logic.

The branching point where the strategy is selected should be as early as possible in the pipeline, before any strategy-specific resources are allocated. This prevents unnecessary initialization (e.g., creating a cloud storage client when direct streaming is selected).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment