Workflow:Vibrantlabsai Ragas Custom Metric Creation
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Evaluation, Metrics |
| Last Updated | 2026-02-12 10:00 GMT |
Overview
End-to-end process for creating custom evaluation metrics in Ragas using decorator-based APIs or class inheritance for domain-specific LLM application assessment.
Description
This workflow covers creating custom metrics for evaluating LLM applications when the built-in metrics do not match the evaluation criteria. Ragas provides three metric types through decorator factories: DiscreteMetric for categorical evaluations (pass/fail, good/bad), NumericMetric for continuous scores (0.0 to 1.0), and RankingMetric for ordered list evaluations. Metrics can be purely programmatic (using Python functions) or LLM-based (using structured prompts with instructor). The decorator system automatically handles validation, async support, and batch processing.
Key outputs:
- A reusable metric instance callable via score() or ascore()
- Metrics compatible with evaluate() and @experiment() frameworks
- Serializable metrics that can be saved and loaded from JSON
Usage
Execute this workflow when the built-in Ragas metrics do not cover your evaluation criteria and you need domain-specific scoring logic. This is appropriate when evaluating custom properties like code quality, task completion accuracy, format compliance, or any domain-specific quality dimension that requires either programmatic logic or LLM-based judgment.
Execution Steps
Step 1: Choose_Metric_Type
Determine the appropriate metric type based on the evaluation output format. DiscreteMetric is for categorical outputs with a fixed set of allowed values (e.g., pass/fail, correct/incorrect, low/medium/high). NumericMetric is for continuous float scores within a range (e.g., 0.0 to 1.0). RankingMetric is for producing ordered lists of items.
Selection criteria:
- Use DiscreteMetric for binary or multi-class judgments
- Use NumericMetric for continuous quality scores or similarity ratings
- Use RankingMetric for ordering or prioritization tasks
Step 2: Define_Metric_Function
Write the scoring function that implements the evaluation logic. For programmatic metrics, this is a Python function that computes the score directly. For LLM-based metrics, instantiate a DiscreteMetric or NumericMetric class with a prompt template that the LLM uses to judge quality. The function receives input fields and returns a MetricResult with a value and optional reason.
Key considerations:
- Function parameters define the metric's required input fields
- Return type must be MetricResult with value matching the metric type
- LLM-based metrics use a prompt string with placeholder variables
- Programmatic metrics should handle edge cases gracefully
Step 3: Apply_Decorator
Apply the appropriate decorator to the scoring function: @discrete_metric for categorical, @numeric_metric for continuous, or @ranking_metric for ordered outputs. The decorator parameters include the metric name and allowed_values (list of strings for discrete, tuple of floats for numeric, integer for ranking). The decorator wraps the function into a full metric instance with validation and async support.
What happens:
- The decorator creates a CustomMetric dataclass instance
- Input parameters are validated using Pydantic type checking
- Output values are validated against allowed_values constraints
- The metric gains score(), ascore(), batch_score(), and abatch_score() methods
Step 4: Test_Metric
Validate the metric by scoring individual samples to ensure correct behavior. Call metric.score() with sample inputs to verify the output values and reasons are appropriate. For LLM-based metrics, pass an LLM instance and verify the structured output matches expectations. Test edge cases and boundary conditions.
Key considerations:
- Test with representative samples from the evaluation domain
- Verify MetricResult values fall within allowed ranges
- For LLM-based metrics, test with different LLM providers
- Check that batch_score produces consistent results
Step 5: Integrate_With_Evaluation
Use the custom metric with the evaluate() function or @experiment() decorator. Custom metrics created via decorators are fully compatible with both evaluation frameworks. They can be combined with built-in metrics in the same evaluation run and benefit from the same concurrency, caching, and error handling infrastructure.
Key considerations:
- Ensure dataset columns match the metric's required input fields
- Custom metrics can be saved to JSON and loaded later via metric.load()
- Metrics can be shared across teams by distributing the JSON definition