Implementation:Marker Inc Korea AutoRAG Run Node Line
| Knowledge Sources | |
|---|---|
| Domains | Pipeline Orchestration, RAG Pipeline Optimization |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
Concrete tool for executing a sequence of pipeline nodes and selecting the best module at each step, provided by the AutoRAG framework.
Description
The run_node_line function is the core execution engine of AutoRAG's optimization loop. It takes an ordered list of Node objects and processes them sequentially, passing the output of each node's best module as input to the next node. If no previous result is provided, it loads the QA dataset from the project's data/qa.parquet file as the initial input.
For each node in the sequence, the function calls node.run(), which internally evaluates all configured module candidates, computes metrics, applies the selection strategy, and saves results. After each node completes, the function reads the node's summary.csv to extract the best module's metadata (filename, module name, parameters, and execution time) and appends it to a running summary list. Once all nodes have been processed, a node-line-level summary.csv is written to the node line directory, aggregating the best module selections from every node.
Usage
Import and call run_node_line when you need to execute a complete node line within an optimization trial. This function is called by Evaluator.start_trial for each node line defined in the YAML configuration, and by Evaluator.restart_trial when resuming from a partially completed trial. It can also be called directly for programmatic pipeline evaluation.
Code Reference
Source Location
- Repository: AutoRAG
- File: autorag/node_line.py (lines 24-65)
Signature
def run_node_line(
nodes: List[Node],
node_line_dir: str,
previous_result: Optional[pd.DataFrame] = None,
):
"""
Run the whole node line by running each node.
:param nodes: A list of nodes.
:param node_line_dir: This node line's directory.
:param previous_result: A result of the previous node line.
If None, it loads qa data from data/qa.parquet.
:return: The final result of the node line.
"""
Import
from autorag.node_line import run_node_line
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| nodes | List[Node] | yes | Ordered list of Node objects representing the pipeline stages to execute. Each Node contains the node type, module candidates, strategy, and metrics. |
| node_line_dir | str | yes | Path to the directory where this node line's results will be stored. Subdirectories are created for each node type (e.g., node_line_dir/retrieval/, node_line_dir/generation/). |
| previous_result | Optional[pd.DataFrame] | no | The output DataFrame from a previous node line, used as input to the first node. If None, the QA dataset is loaded from project_dir/data/qa.parquet. |
Outputs
| Name | Type | Description |
|---|---|---|
| result | pd.DataFrame | The output DataFrame from the best module of the last node in the line. This becomes the input to the next node line if one exists. |
| summary.csv | File (side effect) | A CSV file written to node_line_dir/summary.csv containing the best module selection for each node, with columns: node_type, best_module_filename, best_module_name, best_module_params, best_execution_time. |
Usage Examples
Basic Usage
import pandas as pd
from autorag.schema import Node
from autorag.node_line import run_node_line
# Construct nodes from a YAML configuration dictionary
node_dicts = [
{"node_type": "retrieval", "strategy": {"metrics": ["retrieval_f1"]}, "modules": [...]},
{"node_type": "generation", "strategy": {"metrics": ["bleu"]}, "modules": [...]},
]
nodes = [Node.from_dict(d) for d in node_dicts]
# Load initial QA data
qa_data = pd.read_parquet("my_project/data/qa.parquet", engine="pyarrow")
# Run the node line
final_result = run_node_line(
nodes=nodes,
node_line_dir="my_project/0/pre_retrieve_node_line",
previous_result=qa_data,
)
print(f"Final result columns: {list(final_result.columns)}")
Chaining Node Lines
from autorag.node_line import run_node_line
# Run first node line
result_1 = run_node_line(
nodes=retrieval_nodes,
node_line_dir="my_project/0/retrieve_node_line",
previous_result=qa_data,
)
# Pass the output to the second node line
result_2 = run_node_line(
nodes=generation_nodes,
node_line_dir="my_project/0/post_retrieve_node_line",
previous_result=result_1,
)