Principle:FMInference FlexLLMGen Prediction Evaluation Metrics

Metadata

Field	Value
Sources	FlexLLMGen\|https://github.com/FMInference/FlexLLMGen
Domains	Evaluation, Metrics
Last updated	2026-02-09 00:00 GMT

Overview

A classification evaluation methodology that computes precision, recall, accuracy, and F1 score for LLM-generated predictions on structured data wrangling tasks.

Description

After the LLM generates predictions for data wrangling tasks, evaluation compares predicted labels against ground truth. Different tasks use different matching criteria: entity matching and data imputation use exact string match; schema matching and error_detection_spelling use prefix match; error detection uses suffix match (after splitting on double newlines). The metrics follow standard binary classification: true positives (correctly predicted "yes"), false positives (predicted "yes" when "no"), etc. Precision, recall, accuracy, and F1 are computed from confusion matrix counts.

Usage

Use compute_metrics after collecting model predictions to evaluate performance on data wrangling benchmarks.

Theoretical Basis

Standard classification metrics: Precision = TP/(TP+FP), Recall = TP/(TP+FN), Accuracy = correct/total, F1 = 2*P*R/(P+R). Task-specific matching handles the fact that LLM outputs may contain extra text beyond the label.

Related Pages

Implementation:FMInference_FlexLLMGen_Compute_Metrics

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment