Principle:Explodinggradients Ragas Metric Baseline Correlation

Metric Baseline Correlation

Metric Baseline Correlation is a principle in the Ragas evaluation toolkit for measuring the statistical agreement between metric predictions and human judgments. It establishes a quantitative baseline against which prompt optimization improvements can be compared.

Motivation

Before optimizing a metric's prompts, it is important to understand how well the metric already performs against human judgment. Correlation metrics provide this understanding:

Baseline establishment -- The pre-optimization correlation sets a benchmark for improvement.
Optimization validation -- Post-optimization correlation should exceed the baseline, confirming that prompt changes improved alignment.
Metric comparison -- Different metrics or prompt variants can be compared by their correlation with human labels.

Without a baseline measurement, it is impossible to know whether optimization is producing meaningful improvement or simply overfitting to the training data.

Theoretical Foundation

Correlation for Discrete Metrics

For metrics that produce categorical outputs (e.g., "pass"/"fail", or custom discrete categories), Cohen's Kappa is used to measure agreement:

$κ = \frac{p_{o} - p_{e}}{1 - p_{e}}$

Where:

$p_{o}$ is the observed agreement (proportion of samples where the metric prediction matches the human label).
$p_{e}$ is the expected agreement by chance (calculated from the marginal distributions of both raters).

Cohen's Kappa has several desirable properties for evaluation metrics:

Chance correction -- Unlike raw accuracy, Kappa accounts for agreement that would occur by random guessing. A metric that produces random outputs will have $κ \approx 0$ regardless of class distribution.
Interpretable scale -- Values range from -1 (complete disagreement) through 0 (chance agreement) to 1 (perfect agreement). Generally, $κ > 0.6$ indicates substantial agreement.
Class imbalance robustness -- By adjusting for expected chance agreement, Kappa provides a fair measure even when classes are imbalanced.

Correlation for Numeric Metrics

For metrics that produce continuous numeric scores, Pearson correlation coefficient is used:

$r = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}} \cdot \sqrt{\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}}}$

Where $x$ represents the gold (human) labels and $y$ represents the metric predictions.

Pearson correlation properties:

Linear relationship -- Measures the strength and direction of the linear relationship between metric outputs and human scores.
Scale invariant -- Values range from -1 (perfect negative correlation) through 0 (no linear relationship) to 1 (perfect positive correlation).
Normalized -- Unlike MSE, correlation is independent of the absolute scale of the metric outputs.

Correlation vs. Loss

Correlation and loss functions serve complementary roles:

Aspect	Correlation	Loss
Purpose	Diagnostic measure of agreement	Optimization objective
Used during	Before/after optimization	During optimization
Direction	Higher is better	Depends on loss type (higher accuracy is better; lower MSE is better)
Chance correction	Yes (for Cohen's Kappa)	No

The correlation provides a human-interpretable measure of metric quality, while loss functions provide the gradient signal that drives the optimizer.

Workflow

A typical baseline correlation workflow:

Collect annotations -- Gather human annotations for the metric.
Run baseline metric -- Evaluate the metric with its default prompts on the annotated data.
Compute baseline correlation -- Calculate Cohen's Kappa (discrete) or Pearson r (numeric) between predictions and human labels.
Optimize prompts -- Run prompt optimization.
Compute post-optimization correlation -- Repeat the correlation measurement with optimized prompts.
Compare -- The improvement in correlation quantifies the optimization's value.

Implemented By

Implementation: Metric Get Correlation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment