Principle:FMInference FlexLLMGen Prediction Evaluation Metrics
Metadata
| Field | Value |
|---|---|
| Sources | FlexLLMGen|https://github.com/FMInference/FlexLLMGen |
| Domains | Evaluation, Metrics |
| Last updated | 2026-02-09 00:00 GMT |
Overview
A classification evaluation methodology that computes precision, recall, accuracy, and F1 score for LLM-generated predictions on structured data wrangling tasks.
Description
After the LLM generates predictions for data wrangling tasks, evaluation compares predicted labels against ground truth. Different tasks use different matching criteria: entity matching and data imputation use exact string match; schema matching and error_detection_spelling use prefix match; error detection uses suffix match (after splitting on double newlines). The metrics follow standard binary classification: true positives (correctly predicted "yes"), false positives (predicted "yes" when "no"), etc. Precision, recall, accuracy, and F1 are computed from confusion matrix counts.
Usage
Use compute_metrics after collecting model predictions to evaluate performance on data wrangling benchmarks.
Theoretical Basis
Standard classification metrics: Precision = TP/(TP+FP), Recall = TP/(TP+FN), Accuracy = correct/total, F1 = 2*P*R/(P+R). Task-specific matching handles the fact that LLM outputs may contain extra text beyond the label.