Workflow:Spcl Graph of thoughts GoT Document Merging Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLM_Reasoning, Graph_Based_Inference, Document_Processing, Benchmarking |
| Last Updated | 2026-02-14 04:00 GMT |
Overview
End-to-end process for merging multiple NDA (Non-Disclosure Agreement) documents into a single consolidated NDA using Graph of Thoughts, which generates multiple merge candidates, scores them for information retention and redundancy, and refines the best result.
Description
This workflow applies the GoT framework to document merging, a creative generation task where the goal is to combine four NDA documents into one that maximizes information retention while minimizing redundancy. Unlike sorting or keyword counting, this task uses LLM-based scoring (not ground truth comparison) since quality is subjective. The GoT approach generates multiple merge candidates, scores them using LLM-evaluated redundancy and retention metrics, selects the best candidates, aggregates them into improved versions, and refines the result through iterative improvement. Five reasoning approaches are compared: IO, CoT, ToT, GoT (full document merging), and GoT2 (partial document merging with hierarchical aggregation).
Usage
Execute this workflow when you have multiple documents covering similar topics and need to merge them into a single comprehensive document. It is appropriate when document quality must be evaluated subjectively (information coverage vs. redundancy) rather than against a deterministic ground truth, and when exploring how different LLM reasoning topologies affect generation quality.
Execution Steps
Step 1: Document Corpus Loading
Load the benchmark dataset from a CSV file containing sets of four NDA documents each. Each sample consists of four source documents that need to be merged into one. Load the pure document templates from a JSON corpus of legal NDA texts.
Key considerations:
- Each sample provides exactly four NDA documents as input
- Documents are stored in CSV columns (one document per column)
- The corpus includes realistic legal language for meaningful evaluation
Step 2: Graph of Operations Construction
Build the Graph of Operations for the chosen approach. For GoT (full merge), the graph generates 5 merge candidates, scores each with LLM-based evaluation (3 scoring rounds per candidate), keeps the top 3, aggregates them into improved versions, scores again, keeps the best, then generates 10 improvement candidates and scores to select the final result. For GoT2 (partial merge), pairs of documents are merged first, then the partial merges are hierarchically aggregated.
What happens:
- GoT: Generate(5) → Score(3) → KeepBestN(3) → Aggregate(5) → Score(3) → KeepBestN(1) → Generate(10) → Score(3) → KeepBestN(1)
- GoT2: Pairwise Selector → Generate(5) → Score(3) → KeepBestN(1) per pair, then hierarchical Aggregate → Score → KeepBestN with improvement rounds
- Scoring is LLM-based: the model rates redundancy (0-10) and retention (0-10), combined via F1
Step 3: Prompter and Parser Configuration
Instantiate the DocMergePrompter and DocMergeParser. The Prompter generates merge prompts that present all source documents and ask for a consolidated NDA, scoring prompts that ask the LLM to evaluate redundancy and information retention, and aggregate prompts that combine multiple merge candidates. The Parser extracts merged documents from tagged output (between Merged tags), parses numeric scores from tagged sections (Redundancy and Retained tags), and handles edge cases in LLM responses.
Key considerations:
- Merge prompts include all source documents inline for context
- Score prompts request structured output with XML-style tags
- The F1 score combines redundancy and retention into a single metric
- Partial merge prompts in GoT2 only include the relevant document subset
Step 4: Controller Execution with LLM Scoring
Execute the Controller, which processes operations in BFS order. The distinguishing feature of this workflow is that scoring requires LLM calls (not local functions), making it significantly more expensive per operation. Each Score operation queries the LLM 3 times and averages the redundancy and retention ratings.
What happens:
- Generate operations produce merge candidates by querying the LLM
- Score operations query the LLM to evaluate each candidate (3 rounds averaged)
- KeepBestN selects top candidates by F1 score
- Aggregate operations combine the best candidates into refined versions
- The improvement loop generates more candidates from the best aggregate
- Each thought state tracks both the merged document and the source document set
Step 5: Result Serialization
Serialize the complete execution graph to JSON. Because document content is large, the output files are substantially bigger than for the sorting or keyword counting tasks. Thought states store the full merged document text alongside scores and metadata.
Key considerations:
- Parts sets in thought states must be converted from Python sets to lists for JSON serialization
- Each result file captures the full document content at each operation stage
- Cost tracking is critical since LLM-based scoring multiplies API usage
Step 6: Quality Evaluation and Comparison
Run all five approaches across the sample set and compare using the plotting script. Quality is measured by the LLM-assigned F1 score (harmonic mean of redundancy and retention ratings). The GoT approaches typically produce higher-quality merges than linear methods.
Key considerations:
- No ground truth exists; quality is entirely LLM-evaluated
- The plot script generates comparative charts across all methods
- GoT2 (partial merge) can outperform GoT (full merge) for complex document sets