Overview
KeywordCountingParser is a domain-specific parser that extracts JSON frequency dictionaries from LLM text responses and updates thought state dictionaries for the keyword counting task. It subclasses the abstract Parser base class and implements its five abstract methods. The class is defined in the keyword counting example file.
Description
The KeywordCountingParser manages a response cache (self.cache) and provides a key helper method strip_answer_json that underpins all JSON extraction logic. It handles two fundamentally different response formats: paragraph/sentence split JSON (for GoT decomposition) and frequency dictionary JSON (for counting and aggregation results).
Code Reference
Helper Method
def strip_answer_json(self, text: str) -> str:
"""
Extracts JSON from LLM response text.
1. Strip whitespace
2. If "Output:" present, take everything after it
3. Find last '{' and last '}' positions
4. Extract substring between them (inclusive)
5. Validate with json.loads(); return '{}' on failure
"""
text = text.strip()
if "Output:" in text:
text = text[text.index("Output:") + len("Output:"):].strip()
start = text.rfind("{")
end = text.rfind("}")
if start == -1 or end == -1:
return "{}"
text = text[start : end + 1]
try:
json.loads(text)
return text
except:
return "{}"
Key Methods
class KeywordCountingParser(parser.Parser):
def __init__(self) -> None:
self.cache = {}
def parse_generate_answer(self, state: Dict, texts: List[str]) -> List[Dict]:
"""
Two code paths:
1. GoT phase 0 (split): Extract JSON with 'Paragraph'/'Sentence' keys,
create one state per key with phase=1, sub_text, part, current="".
2. All other: Extract frequency dict via strip_answer_json,
set as current with phase=2.
"""
def parse_aggregation_answer(self, states: List[Dict], texts: List[str]) -> Union[Dict, List[Dict]]:
"""
Extracts merged frequency dictionary from response.
- Concatenates sub_text from both input states
- Stores pre-aggregation dicts in aggr1 and aggr2 fields
- Handles 0 or 1 input states by substituting empty dicts
- Asserts at most 2 input states
"""
def parse_improve_answer(self, state: Dict, texts: List[str]) -> Dict:
"""
Extracts corrected frequency dictionary.
Asserts exactly 1 text. Returns updated state with new current.
"""
def parse_validation_answer(self, state: Dict, texts: List[str]) -> bool:
"""Not implemented (returns None). Validation uses programmatic valid_aggregation."""
def parse_score_answer(self, states: List[Dict], texts: List[str]) -> List[float]:
"""Not implemented (returns None). Scoring uses programmatic num_errors."""
Detailed parse_aggregation_answer Logic
def parse_aggregation_answer(self, states, texts):
assert len(states) <= 2
if len(states) == 0:
states = [{"current": "{}", "sub_text": ""}, {"current": "{}", "sub_text": ""}]
elif len(states) == 1:
states.append({"current": "{}", "sub_text": ""})
new_states = []
for text in texts:
answer = self.strip_answer_json(text)
new_state = states[0].copy()
new_state["sub_text"] = (
states[0].get("sub_text", "") + states[1].get("sub_text", "")
)
new_state["current"] = answer
new_state["aggr1"] = states[0]["current"]
new_state["aggr2"] = states[1]["current"]
new_states.append(new_state)
return new_states
I/O Contract
Input
| Parameter |
Type |
Description
|
state |
Dict |
Current thought state with keys: original, current, method, phase
|
states |
List[Dict] |
For aggregation: at most 2 states with frequency dictionaries
|
texts |
List[str] |
Raw LLM response strings containing JSON dictionaries
|
Output
| Method |
Return Type |
Description
|
parse_generate_answer |
List[Dict] |
Split: one state per paragraph/sentence. Count: state with frequency dict as current
|
parse_aggregation_answer |
List[Dict] |
State with merged frequency dict, aggr1, aggr2, combined sub_text
|
parse_improve_answer |
Dict |
State with corrected frequency dict as current
|
parse_validation_answer |
bool |
Not implemented (returns None); validation is programmatic
|
parse_score_answer |
List[float] |
Not implemented (returns None); scoring is programmatic
|
Usage Examples
Parsing a Split Response
parser = KeywordCountingParser()
state = {
"original": "Alexandra boarded the first flight...",
"current": "",
"method": "got4",
"phase": 0,
}
texts = ['{"Paragraph 1": "Alexandra boarded...", "Paragraph 2": "Her first stop...", "Paragraph 3": "The adventure...", "Paragraph 4": "Journeying westward..."}']
new_states = parser.parse_generate_answer(state, texts)
# Returns 4 states, each with sub_text, part ("Paragraph 1" etc.), phase=1, current=""
Parsing a Frequency Count Response
parser = KeywordCountingParser()
state = {"original": "...", "current": "", "method": "io", "phase": 0}
texts = ['Output: {"Canada": 1, "Mexico": 1, "Brazil": 1}']
new_states = parser.parse_generate_answer(state, texts)
# Returns [{"current": '{"Canada": 1, "Mexico": 1, "Brazil": 1}', "phase": 2, ...}]
Parsing an Aggregation Response
parser = KeywordCountingParser()
states = [
{"current": '{"Canada": 1}', "sub_text": "First paragraph..."},
{"current": '{"Mexico": 1}', "sub_text": "Second paragraph..."},
]
texts = ['{"Canada": 1, "Mexico": 1}']
new_states = parser.parse_aggregation_answer(states, texts)
# Returns [{"current": '{"Canada": 1, "Mexico": 1}', "aggr1": '{"Canada": 1}',
# "aggr2": '{"Mexico": 1}', "sub_text": "First paragraph...Second paragraph..."}]
Related Pages