Principle:Explodinggradients Ragas Metric Baseline Correlation
Metric Baseline Correlation
Metric Baseline Correlation is a principle in the Ragas evaluation toolkit for measuring the statistical agreement between metric predictions and human judgments. It establishes a quantitative baseline against which prompt optimization improvements can be compared.
Motivation
Before optimizing a metric's prompts, it is important to understand how well the metric already performs against human judgment. Correlation metrics provide this understanding:
- Baseline establishment -- The pre-optimization correlation sets a benchmark for improvement.
- Optimization validation -- Post-optimization correlation should exceed the baseline, confirming that prompt changes improved alignment.
- Metric comparison -- Different metrics or prompt variants can be compared by their correlation with human labels.
Without a baseline measurement, it is impossible to know whether optimization is producing meaningful improvement or simply overfitting to the training data.
Theoretical Foundation
Correlation for Discrete Metrics
For metrics that produce categorical outputs (e.g., "pass"/"fail", or custom discrete categories), Cohen's Kappa is used to measure agreement:
Where:
- is the observed agreement (proportion of samples where the metric prediction matches the human label).
- is the expected agreement by chance (calculated from the marginal distributions of both raters).
Cohen's Kappa has several desirable properties for evaluation metrics:
- Chance correction -- Unlike raw accuracy, Kappa accounts for agreement that would occur by random guessing. A metric that produces random outputs will have regardless of class distribution.
- Interpretable scale -- Values range from -1 (complete disagreement) through 0 (chance agreement) to 1 (perfect agreement). Generally, indicates substantial agreement.
- Class imbalance robustness -- By adjusting for expected chance agreement, Kappa provides a fair measure even when classes are imbalanced.
Correlation for Numeric Metrics
For metrics that produce continuous numeric scores, Pearson correlation coefficient is used:
Where represents the gold (human) labels and represents the metric predictions.
Pearson correlation properties:
- Linear relationship -- Measures the strength and direction of the linear relationship between metric outputs and human scores.
- Scale invariant -- Values range from -1 (perfect negative correlation) through 0 (no linear relationship) to 1 (perfect positive correlation).
- Normalized -- Unlike MSE, correlation is independent of the absolute scale of the metric outputs.
Correlation vs. Loss
Correlation and loss functions serve complementary roles:
| Aspect | Correlation | Loss |
|---|---|---|
| Purpose | Diagnostic measure of agreement | Optimization objective |
| Used during | Before/after optimization | During optimization |
| Direction | Higher is better | Depends on loss type (higher accuracy is better; lower MSE is better) |
| Chance correction | Yes (for Cohen's Kappa) | No |
The correlation provides a human-interpretable measure of metric quality, while loss functions provide the gradient signal that drives the optimizer.
Workflow
A typical baseline correlation workflow:
- Collect annotations -- Gather human annotations for the metric.
- Run baseline metric -- Evaluate the metric with its default prompts on the annotated data.
- Compute baseline correlation -- Calculate Cohen's Kappa (discrete) or Pearson r (numeric) between predictions and human labels.
- Optimize prompts -- Run prompt optimization.
- Compute post-optimization correlation -- Repeat the correlation measurement with optimized prompts.
- Compare -- The improvement in correlation quantifies the optimization's value.
Implemented By
See Also
- Implementation:Explodinggradients_Ragas_Metric_Get_Correlation
- Optimization Loss Functions -- Loss functions used during optimization (complementary to correlation).
- Human Annotation Collection -- Provides the human labels for correlation measurement.
- Genetic Prompt Optimization -- Optimizer whose improvements are measured by correlation.
- DSPy Prompt Optimization -- Alternative optimizer similarly benchmarked.