Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Marker Inc Korea AutoRAG Pipeline Runner Init

From Leeroopedia
Knowledge Sources
Domains RAG Pipeline Deployment, Module Instantiation
Last Updated 2026-02-12 00:00 GMT

Overview

Pipeline runner initialization converts a static pipeline configuration into a live, executable chain of module instances ready to process queries.

Description

Once the best configuration has been extracted from an evaluation trial, the pipeline must be materialized -- each module specification in the YAML must be resolved to an actual Python object that can process data. Pipeline runner initialization handles this materialization step.

The initialization iterates over every node in every node line of the configuration. For each node, it enforces that exactly one module is specified (the deployment invariant established by best config extraction). It then extracts the module_type string and uses a support registry (get_support_modules) to dynamically look up the corresponding Python class. The class is instantiated with the project_dir and any additional module-specific parameters from the config. The resulting instances are stored in an ordered list that mirrors the pipeline's execution order.

This pattern follows the Registry and Factory design patterns: module types are registered strings, and the factory function resolves them to concrete classes at runtime. This means the pipeline configuration is fully declarative -- no Python code needs to change when swapping modules.

Usage

Runner initialization is used whenever you need to deploy an AutoRAG pipeline for inference. There are three entry points:

  • Direct construction: Pass a config dictionary and project directory to the constructor.
  • From YAML: Use the from_yaml class method with a path to the extracted best config YAML file.
  • From trial folder: Use the from_trial_folder class method, which internally calls extract_best_config and then initializes the runner.

Theoretical Basis

The initialization follows a linear module chain construction pattern:

modules = []
for each node_line in config["node_lines"]:
    for each node in node_line["nodes"]:
        assert len(node["modules"]) == 1
        module_cls = registry.lookup(node["modules"][0]["module_type"])
        module_instance = module_cls(project_dir=project_dir, **params)
        modules.append(module_instance)

Key design constraints:

  • Single module per node: The runner enforces exactly one module per node. If the config still has multiple candidates (i.e., it was not extracted), a ValueError is raised.
  • Ordered execution: Modules are stored in a flat list preserving node line and node order, establishing the sequential execution chain.
  • Environment propagation: The PROJECT_DIR environment variable is set during initialization so that downstream modules can locate project resources (corpus data, vector stores, BM25 indices).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment