Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Vibrantlabsai Ragas Custom Metric Creation

From Leeroopedia
Knowledge Sources
Domains LLM_Ops, Evaluation, Metrics
Last Updated 2026-02-12 10:00 GMT

Overview

End-to-end process for creating custom evaluation metrics in Ragas using decorator-based APIs or class inheritance for domain-specific LLM application assessment.

Description

This workflow covers creating custom metrics for evaluating LLM applications when the built-in metrics do not match the evaluation criteria. Ragas provides three metric types through decorator factories: DiscreteMetric for categorical evaluations (pass/fail, good/bad), NumericMetric for continuous scores (0.0 to 1.0), and RankingMetric for ordered list evaluations. Metrics can be purely programmatic (using Python functions) or LLM-based (using structured prompts with instructor). The decorator system automatically handles validation, async support, and batch processing.

Key outputs:

  • A reusable metric instance callable via score() or ascore()
  • Metrics compatible with evaluate() and @experiment() frameworks
  • Serializable metrics that can be saved and loaded from JSON

Usage

Execute this workflow when the built-in Ragas metrics do not cover your evaluation criteria and you need domain-specific scoring logic. This is appropriate when evaluating custom properties like code quality, task completion accuracy, format compliance, or any domain-specific quality dimension that requires either programmatic logic or LLM-based judgment.

Execution Steps

Step 1: Choose_Metric_Type

Determine the appropriate metric type based on the evaluation output format. DiscreteMetric is for categorical outputs with a fixed set of allowed values (e.g., pass/fail, correct/incorrect, low/medium/high). NumericMetric is for continuous float scores within a range (e.g., 0.0 to 1.0). RankingMetric is for producing ordered lists of items.

Selection criteria:

  • Use DiscreteMetric for binary or multi-class judgments
  • Use NumericMetric for continuous quality scores or similarity ratings
  • Use RankingMetric for ordering or prioritization tasks

Step 2: Define_Metric_Function

Write the scoring function that implements the evaluation logic. For programmatic metrics, this is a Python function that computes the score directly. For LLM-based metrics, instantiate a DiscreteMetric or NumericMetric class with a prompt template that the LLM uses to judge quality. The function receives input fields and returns a MetricResult with a value and optional reason.

Key considerations:

  • Function parameters define the metric's required input fields
  • Return type must be MetricResult with value matching the metric type
  • LLM-based metrics use a prompt string with placeholder variables
  • Programmatic metrics should handle edge cases gracefully

Step 3: Apply_Decorator

Apply the appropriate decorator to the scoring function: @discrete_metric for categorical, @numeric_metric for continuous, or @ranking_metric for ordered outputs. The decorator parameters include the metric name and allowed_values (list of strings for discrete, tuple of floats for numeric, integer for ranking). The decorator wraps the function into a full metric instance with validation and async support.

What happens:

  • The decorator creates a CustomMetric dataclass instance
  • Input parameters are validated using Pydantic type checking
  • Output values are validated against allowed_values constraints
  • The metric gains score(), ascore(), batch_score(), and abatch_score() methods

Step 4: Test_Metric

Validate the metric by scoring individual samples to ensure correct behavior. Call metric.score() with sample inputs to verify the output values and reasons are appropriate. For LLM-based metrics, pass an LLM instance and verify the structured output matches expectations. Test edge cases and boundary conditions.

Key considerations:

  • Test with representative samples from the evaluation domain
  • Verify MetricResult values fall within allowed ranges
  • For LLM-based metrics, test with different LLM providers
  • Check that batch_score produces consistent results

Step 5: Integrate_With_Evaluation

Use the custom metric with the evaluate() function or @experiment() decorator. Custom metrics created via decorators are fully compatible with both evaluation frameworks. They can be combined with built-in metrics in the same evaluation run and benefit from the same concurrency, caching, and error handling infrastructure.

Key considerations:

  • Ensure dataset columns match the metric's required input fields
  • Custom metrics can be saved to JSON and loaded later via metric.load()
  • Metrics can be shared across teams by distributing the JSON definition

Execution Diagram

GitHub URL

Workflow Repository