Workflow:Spcl Graph of thoughts GoT Document Merging Pipeline

Knowledge Sources	Graph of Thoughts Graph of Thoughts: Solving Elaborate Problems with Large Language Models
Domains	LLM_Reasoning, Graph_Based_Inference, Document_Processing, Benchmarking
Last Updated	2026-02-14 04:00 GMT

Overview

End-to-end process for merging multiple NDA (Non-Disclosure Agreement) documents into a single consolidated NDA using Graph of Thoughts, which generates multiple merge candidates, scores them for information retention and redundancy, and refines the best result.

Description

This workflow applies the GoT framework to document merging, a creative generation task where the goal is to combine four NDA documents into one that maximizes information retention while minimizing redundancy. Unlike sorting or keyword counting, this task uses LLM-based scoring (not ground truth comparison) since quality is subjective. The GoT approach generates multiple merge candidates, scores them using LLM-evaluated redundancy and retention metrics, selects the best candidates, aggregates them into improved versions, and refines the result through iterative improvement. Five reasoning approaches are compared: IO, CoT, ToT, GoT (full document merging), and GoT2 (partial document merging with hierarchical aggregation).

Usage

Execute this workflow when you have multiple documents covering similar topics and need to merge them into a single comprehensive document. It is appropriate when document quality must be evaluated subjectively (information coverage vs. redundancy) rather than against a deterministic ground truth, and when exploring how different LLM reasoning topologies affect generation quality.

Execution Steps

Step 1: Document Corpus Loading

Load the benchmark dataset from a CSV file containing sets of four NDA documents each. Each sample consists of four source documents that need to be merged into one. Load the pure document templates from a JSON corpus of legal NDA texts.

Key considerations:

Each sample provides exactly four NDA documents as input
Documents are stored in CSV columns (one document per column)
The corpus includes realistic legal language for meaningful evaluation

Step 2: Graph of Operations Construction

Build the Graph of Operations for the chosen approach. For GoT (full merge), the graph generates 5 merge candidates, scores each with LLM-based evaluation (3 scoring rounds per candidate), keeps the top 3, aggregates them into improved versions, scores again, keeps the best, then generates 10 improvement candidates and scores to select the final result. For GoT2 (partial merge), pairs of documents are merged first, then the partial merges are hierarchically aggregated.

What happens:

GoT: Generate(5) → Score(3) → KeepBestN(3) → Aggregate(5) → Score(3) → KeepBestN(1) → Generate(10) → Score(3) → KeepBestN(1)
GoT2: Pairwise Selector → Generate(5) → Score(3) → KeepBestN(1) per pair, then hierarchical Aggregate → Score → KeepBestN with improvement rounds
Scoring is LLM-based: the model rates redundancy (0-10) and retention (0-10), combined via F1

Step 3: Prompter and Parser Configuration

Instantiate the DocMergePrompter and DocMergeParser. The Prompter generates merge prompts that present all source documents and ask for a consolidated NDA, scoring prompts that ask the LLM to evaluate redundancy and information retention, and aggregate prompts that combine multiple merge candidates. The Parser extracts merged documents from tagged output (between Merged tags), parses numeric scores from tagged sections (Redundancy and Retained tags), and handles edge cases in LLM responses.

Key considerations:

Merge prompts include all source documents inline for context
Score prompts request structured output with XML-style tags
The F1 score combines redundancy and retention into a single metric
Partial merge prompts in GoT2 only include the relevant document subset

Step 4: Controller Execution with LLM Scoring

Execute the Controller, which processes operations in BFS order. The distinguishing feature of this workflow is that scoring requires LLM calls (not local functions), making it significantly more expensive per operation. Each Score operation queries the LLM 3 times and averages the redundancy and retention ratings.

What happens:

Generate operations produce merge candidates by querying the LLM
Score operations query the LLM to evaluate each candidate (3 rounds averaged)
KeepBestN selects top candidates by F1 score
Aggregate operations combine the best candidates into refined versions
The improvement loop generates more candidates from the best aggregate
Each thought state tracks both the merged document and the source document set

Step 5: Result Serialization

Serialize the complete execution graph to JSON. Because document content is large, the output files are substantially bigger than for the sorting or keyword counting tasks. Thought states store the full merged document text alongside scores and metadata.

Key considerations:

Parts sets in thought states must be converted from Python sets to lists for JSON serialization
Each result file captures the full document content at each operation stage
Cost tracking is critical since LLM-based scoring multiplies API usage

Step 6: Quality Evaluation and Comparison

Run all five approaches across the sample set and compare using the plotting script. Quality is measured by the LLM-assigned F1 score (harmonic mean of redundancy and retention ratings). The GoT approaches typically produce higher-quality merges than linear methods.

Key considerations:

No ground truth exists; quality is entirely LLM-evaluated
The plot script generates comparative charts across all methods
GoT2 (partial merge) can outperform GoT (full merge) for complex document sets

Execution Diagram

GitHub URL

Workflow Repository