Workflow:Spcl Graph of thoughts GoT Keyword Counting Pipeline

Knowledge Sources	Graph of Thoughts Graph of Thoughts: Solving Elaborate Problems with Large Language Models
Domains	LLM_Reasoning, Graph_Based_Inference, NLP, Benchmarking
Last Updated	2026-02-14 04:00 GMT

Overview

End-to-end process for counting keyword (country name) frequencies in text passages using Graph of Thoughts, which decomposes text into paragraphs or sentences, counts keywords in each part independently, and aggregates results with validation and improvement.

Description

This workflow applies the GoT framework to keyword frequency counting in natural language text. The task is to count how many times each country name appears in a passage. The GoT approach splits the input text into sub-passages (4 paragraphs, 8 paragraphs, or individual sentences depending on granularity), counts keywords in each sub-passage independently, and hierarchically aggregates the frequency dictionaries. A ValidateAndImprove operation checks aggregation correctness and retries if the merged counts are inconsistent. Seven reasoning approaches are benchmarked: IO, CoT, ToT, ToT2, GoT4, GoT8, and GoTx (sentence-level decomposition).

Usage

Execute this workflow when you have a text passage containing multiple occurrences of known keywords and need to count their frequencies accurately. It is particularly useful for benchmarking how different LLM reasoning topologies handle information extraction tasks where the input is too long for reliable single-pass processing.

Execution Steps

Step 1: Dataset Preparation

Load the benchmark dataset from a CSV file containing text passages and their ground truth keyword frequency lists. Extract the complete set of possible country names across all samples to provide context for scoring functions.

Key considerations:

The dataset generator creates passages with known country name frequencies
Ground truth is stored as a list of country names (converted to frequency dicts for comparison)
The full list of possible countries is needed for the local scoring function

Step 2: Graph of Operations Construction

Build the Graph of Operations according to the chosen decomposition granularity. For GoT4, the graph splits text into 4 paragraphs; for GoT8, into 8 paragraphs; for GoTx, into individual sentences. Each sub-passage goes through a parallel branch of Generate (10 candidates) → Score (count errors against local sub-text) → KeepBestN (1). Sub-passage results are then aggregated in a binary tree pattern using Aggregate → ValidateAndImprove → Score → KeepBestN at each level.

What happens:

Generate splits the text into N sub-passages via the Prompter
Selector operations route each sub-passage to its own counting branch
Each branch generates multiple frequency dict candidates and keeps the best
Hierarchical pairwise aggregation merges frequency dicts bottom-up
ValidateAndImprove checks that merged counts equal the sum of parts
If validation fails, the Improve prompt asks the LLM to fix the aggregation

Step 3: Prompter and Parser Configuration

Instantiate the KeywordCountingPrompter and KeywordCountingParser. The Prompter generates text-splitting prompts (paragraph or sentence level), keyword counting prompts with few-shot examples, aggregation prompts for merging frequency dictionaries, and improvement prompts for fixing incorrect aggregations. The Parser extracts JSON frequency dictionaries from LLM responses and manages sub-text routing through thought state metadata.

Key considerations:

Prompts output JSON frequency dictionaries for structured parsing
The split prompt varies by granularity (4 paragraphs, 8 paragraphs, or sentences)
The improve prompt for aggregation shows both partial results and the incorrect merge
Phase tracking in thought state controls prompt selection

Step 4: Controller Execution with Validation

Execute the Controller, which processes the operation DAG in BFS order. The key differentiator in this workflow is the ValidateAndImprove operation: after each aggregation, a validation function checks whether the merged frequency dictionary is consistent with its inputs. If validation fails, the Controller triggers the Improve path, which asks the LLM to fix the aggregation, up to a configurable number of retries.

What happens:

Operations execute when all predecessors are complete
Scoring uses a local function that counts errors against ground truth or sub-text
ValidateAndImprove checks: all keys from both inputs appear in output, and frequencies sum correctly
Failed validations trigger the Improve prompt with the incorrect merge and both inputs
The retry budget limits how many correction attempts are made

Step 5: Result Serialization and Cost Tracking

Serialize the full execution graph to JSON, capturing all operations, thought states, scores, and validation results. Track cumulative token usage and API cost across all operations for the sample.

Key considerations:

Each method-sample combination produces a separate result file
Budget is shared across all methods; execution halts when depleted
Cost tracking covers both prompt and completion tokens

Step 6: Multi-method Comparison

Run all seven approaches across the sample set and use the plotting script to generate comparative accuracy and cost charts. The GoTx (sentence-level) approach typically achieves the highest accuracy but at greater cost, while GoT4 offers a good accuracy-cost balance.

Key considerations:

Seven approaches provide a comprehensive reasoning topology comparison
The plotting script reads result directories and generates publication-ready figures
Accuracy is measured as the number of correct frequency entries

Execution Diagram

GitHub URL

Workflow Repository