Workflow:Iamhankai Forest of Thought CGDM Post Processing

Knowledge Sources	Forest-of-Thought Forest-of-Thought: Scaling Test-Time Compute
Domains	LLM_Reasoning, Answer_Aggregation, Post_Processing
Last Updated	2026-02-14 03:00 GMT

Overview

Two-stage post-processing pipeline that extracts final answers from Forest-of-Thought results using majority voting followed by LLM-as-judge fallback for ambiguous cases.

Description

This workflow operates on the JSON output files produced by the FoT Benchmark Evaluation workflow. It implements the Consensus-Guided Decision Making (CGDM) strategy as a standalone post-processing step. In the first stage, it aggregates all candidate answers from across trees and iterations, extracts normalized answer labels, and performs majority voting. When a clear majority exists, that answer is selected. When the vote is tied or ambiguous, the problem is flagged for the second stage, where a separate (potentially stronger) LLM acts as an expert judge, receiving the question and all candidate answers, then selecting the best one. This two-stage approach separates the computationally expensive inference phase from the answer selection phase, allowing re-evaluation with different models or strategies without re-running the full forest.

Usage

Execute this workflow after the FoT Benchmark Evaluation workflow has produced result JSON files. Use this when you want to re-aggregate answers using a different strategy, when you want to use a stronger model (e.g., QwQ-32B) as the expert judge, or when you want to analyze which problems required expert intervention. This workflow is particularly useful for large-scale evaluations where re-running inference is expensive but answer selection can be iterated cheaply.

Execution Steps

Step 1: Load FoT Result Files

Read the JSON output files produced by the FoT Benchmark Evaluation pipeline. Each file contains per-problem records with tree answers, exploration rewards, UCB scores, and raw answer strings. Load the corresponding ground truth dataset for correctness evaluation.

Key considerations:

Result files use MD5-hashed filenames based on the dataset content
Each record contains answers from multiple trees and multiple MCTS iterations per tree
Ground truth is loaded separately from the original dataset file for independent verification
The result format varies slightly between MCTS, CoT, and ToT base modes

Step 2: Answer Extraction and Normalization

For each problem, extract answer labels from all raw answer strings across all trees. Apply dataset-specific extraction logic that handles multiple answer formats: boxed LaTeX answers, "####" delimited answers, "The answer is" prefixed answers, and plain numeric answers. Normalize extracted labels by removing LaTeX formatting, dollar signs, and whitespace.

Key considerations:

The extraction cascade tries multiple patterns in priority order (boxed > #### > "The answer is" > last number)
Labels are classified by type: digit, option (A/B/C/D), yes/no, or formula
Formula-type labels require symbolic equivalence checking rather than string matching
Null extractions (unparseable answers) are filtered out before voting

Step 3: Majority Voting (First Pass)

Perform frequency-based voting across all extracted answer labels for each problem. If a single answer has the highest count, select it as the final answer. If multiple answers tie for the highest count, or if all answers are nonsensical placeholders (e.g., "I Don't Know"), flag the problem for model-based judging in the next step.

Key considerations:

Voting operates on normalized labels, not raw answer strings
Placeholder answers ("I Don't Know" and variants) are filtered before the voting winner is returned
Ties are defined as multiple answers sharing the maximum vote count
Problems with no valid extracted answers are also routed to the judge

Step 4: Model Judge Fallback (Second Pass)

For problems flagged in the previous step, invoke a separate LLM (the expert judge) to select the best answer. The judge receives the original question and all candidate answers in a numbered format. It is prompted as a mathematics expert and asked to identify the most correct answer. If the judge's response can be parsed as an answer index, retrieve the corresponding answer; otherwise, generate a fresh answer.

Key considerations:

The judge model can be different (and stronger) than the inference model
The judge model is loaded as a separate Pipeline instance (e.g., QwQ-32B-Preview)
Judge responses are parsed by extracting the first number after "The best answer is"
If the judge fails to select, a fresh chain-of-thought generation is used as ultimate fallback
Results are written to a separate output file for analysis

Step 5: Accuracy Evaluation and Reporting

Compare all final answers (from both majority voting and judge selection) against ground truth labels. Apply dataset-appropriate equivalence checking: numeric comparison for GSM8K, symbolic equivalence via sympy for MATH, and multi-layer normalization for complex expressions. Report total accuracy, compile a list of incorrect predictions (bad cases), and write detailed output files.

Key considerations:

Symbolic equivalence handles equivalent forms like "1/2" vs "0.5" vs "\frac{1}{2}"
Bad case lists enable targeted analysis of failure modes
Accuracy is computed separately for majority-vote and judge-resolved problems
Output includes both the final answer and the ground truth label for each problem

Execution Diagram

GitHub URL

Workflow Repository