Workflow:Spcl Graph of thoughts GoT Keyword Counting Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLM_Reasoning, Graph_Based_Inference, NLP, Benchmarking |
| Last Updated | 2026-02-14 04:00 GMT |
Overview
End-to-end process for counting keyword (country name) frequencies in text passages using Graph of Thoughts, which decomposes text into paragraphs or sentences, counts keywords in each part independently, and aggregates results with validation and improvement.
Description
This workflow applies the GoT framework to keyword frequency counting in natural language text. The task is to count how many times each country name appears in a passage. The GoT approach splits the input text into sub-passages (4 paragraphs, 8 paragraphs, or individual sentences depending on granularity), counts keywords in each sub-passage independently, and hierarchically aggregates the frequency dictionaries. A ValidateAndImprove operation checks aggregation correctness and retries if the merged counts are inconsistent. Seven reasoning approaches are benchmarked: IO, CoT, ToT, ToT2, GoT4, GoT8, and GoTx (sentence-level decomposition).
Usage
Execute this workflow when you have a text passage containing multiple occurrences of known keywords and need to count their frequencies accurately. It is particularly useful for benchmarking how different LLM reasoning topologies handle information extraction tasks where the input is too long for reliable single-pass processing.
Execution Steps
Step 1: Dataset Preparation
Load the benchmark dataset from a CSV file containing text passages and their ground truth keyword frequency lists. Extract the complete set of possible country names across all samples to provide context for scoring functions.
Key considerations:
- The dataset generator creates passages with known country name frequencies
- Ground truth is stored as a list of country names (converted to frequency dicts for comparison)
- The full list of possible countries is needed for the local scoring function
Step 2: Graph of Operations Construction
Build the Graph of Operations according to the chosen decomposition granularity. For GoT4, the graph splits text into 4 paragraphs; for GoT8, into 8 paragraphs; for GoTx, into individual sentences. Each sub-passage goes through a parallel branch of Generate (10 candidates) → Score (count errors against local sub-text) → KeepBestN (1). Sub-passage results are then aggregated in a binary tree pattern using Aggregate → ValidateAndImprove → Score → KeepBestN at each level.
What happens:
- Generate splits the text into N sub-passages via the Prompter
- Selector operations route each sub-passage to its own counting branch
- Each branch generates multiple frequency dict candidates and keeps the best
- Hierarchical pairwise aggregation merges frequency dicts bottom-up
- ValidateAndImprove checks that merged counts equal the sum of parts
- If validation fails, the Improve prompt asks the LLM to fix the aggregation
Step 3: Prompter and Parser Configuration
Instantiate the KeywordCountingPrompter and KeywordCountingParser. The Prompter generates text-splitting prompts (paragraph or sentence level), keyword counting prompts with few-shot examples, aggregation prompts for merging frequency dictionaries, and improvement prompts for fixing incorrect aggregations. The Parser extracts JSON frequency dictionaries from LLM responses and manages sub-text routing through thought state metadata.
Key considerations:
- Prompts output JSON frequency dictionaries for structured parsing
- The split prompt varies by granularity (4 paragraphs, 8 paragraphs, or sentences)
- The improve prompt for aggregation shows both partial results and the incorrect merge
- Phase tracking in thought state controls prompt selection
Step 4: Controller Execution with Validation
Execute the Controller, which processes the operation DAG in BFS order. The key differentiator in this workflow is the ValidateAndImprove operation: after each aggregation, a validation function checks whether the merged frequency dictionary is consistent with its inputs. If validation fails, the Controller triggers the Improve path, which asks the LLM to fix the aggregation, up to a configurable number of retries.
What happens:
- Operations execute when all predecessors are complete
- Scoring uses a local function that counts errors against ground truth or sub-text
- ValidateAndImprove checks: all keys from both inputs appear in output, and frequencies sum correctly
- Failed validations trigger the Improve prompt with the incorrect merge and both inputs
- The retry budget limits how many correction attempts are made
Step 5: Result Serialization and Cost Tracking
Serialize the full execution graph to JSON, capturing all operations, thought states, scores, and validation results. Track cumulative token usage and API cost across all operations for the sample.
Key considerations:
- Each method-sample combination produces a separate result file
- Budget is shared across all methods; execution halts when depleted
- Cost tracking covers both prompt and completion tokens
Step 6: Multi-method Comparison
Run all seven approaches across the sample set and use the plotting script to generate comparative accuracy and cost charts. The GoTx (sentence-level) approach typically achieves the highest accuracy but at greater cost, while GoT4 offers a good accuracy-cost balance.
Key considerations:
- Seven approaches provide a comprehensive reasoning topology comparison
- The plotting script reads result directories and generates publication-ready figures
- Accuracy is measured as the number of correct frequency entries