Workflow:Vibrantlabsai Ragas Custom Metric Creation

Knowledge Sources	Ragas Ragas Docs Metrics Concepts
Domains	LLM_Ops, Evaluation, Metrics
Last Updated	2026-02-12 10:00 GMT

Overview

End-to-end process for creating custom evaluation metrics in Ragas using decorator-based APIs or class inheritance for domain-specific LLM application assessment.

Description

This workflow covers creating custom metrics for evaluating LLM applications when the built-in metrics do not match the evaluation criteria. Ragas provides three metric types through decorator factories: DiscreteMetric for categorical evaluations (pass/fail, good/bad), NumericMetric for continuous scores (0.0 to 1.0), and RankingMetric for ordered list evaluations. Metrics can be purely programmatic (using Python functions) or LLM-based (using structured prompts with instructor). The decorator system automatically handles validation, async support, and batch processing.

Key outputs:

A reusable metric instance callable via score() or ascore()
Metrics compatible with evaluate() and @experiment() frameworks
Serializable metrics that can be saved and loaded from JSON

Usage

Execute this workflow when the built-in Ragas metrics do not cover your evaluation criteria and you need domain-specific scoring logic. This is appropriate when evaluating custom properties like code quality, task completion accuracy, format compliance, or any domain-specific quality dimension that requires either programmatic logic or LLM-based judgment.

Execution Steps

Step 1: Choose_Metric_Type

Determine the appropriate metric type based on the evaluation output format. DiscreteMetric is for categorical outputs with a fixed set of allowed values (e.g., pass/fail, correct/incorrect, low/medium/high). NumericMetric is for continuous float scores within a range (e.g., 0.0 to 1.0). RankingMetric is for producing ordered lists of items.

Selection criteria:

Use DiscreteMetric for binary or multi-class judgments
Use NumericMetric for continuous quality scores or similarity ratings
Use RankingMetric for ordering or prioritization tasks

Step 2: Define_Metric_Function

Write the scoring function that implements the evaluation logic. For programmatic metrics, this is a Python function that computes the score directly. For LLM-based metrics, instantiate a DiscreteMetric or NumericMetric class with a prompt template that the LLM uses to judge quality. The function receives input fields and returns a MetricResult with a value and optional reason.

Key considerations:

Function parameters define the metric's required input fields
Return type must be MetricResult with value matching the metric type
LLM-based metrics use a prompt string with placeholder variables
Programmatic metrics should handle edge cases gracefully

Step 3: Apply_Decorator

Apply the appropriate decorator to the scoring function: @discrete_metric for categorical, @numeric_metric for continuous, or @ranking_metric for ordered outputs. The decorator parameters include the metric name and allowed_values (list of strings for discrete, tuple of floats for numeric, integer for ranking). The decorator wraps the function into a full metric instance with validation and async support.

What happens:

The decorator creates a CustomMetric dataclass instance
Input parameters are validated using Pydantic type checking
Output values are validated against allowed_values constraints
The metric gains score(), ascore(), batch_score(), and abatch_score() methods

Step 4: Test_Metric

Validate the metric by scoring individual samples to ensure correct behavior. Call metric.score() with sample inputs to verify the output values and reasons are appropriate. For LLM-based metrics, pass an LLM instance and verify the structured output matches expectations. Test edge cases and boundary conditions.

Key considerations:

Test with representative samples from the evaluation domain
Verify MetricResult values fall within allowed ranges
For LLM-based metrics, test with different LLM providers
Check that batch_score produces consistent results

Step 5: Integrate_With_Evaluation

Use the custom metric with the evaluate() function or @experiment() decorator. Custom metrics created via decorators are fully compatible with both evaluation frameworks. They can be combined with built-in metrics in the same evaluation run and benefit from the same concurrency, caching, and error handling infrastructure.

Key considerations:

Ensure dataset columns match the metric's required input fields
Custom metrics can be saved to JSON and loaded later via metric.load()
Metrics can be shared across teams by distributing the JSON definition

Execution Diagram

GitHub URL

Workflow Repository