Implementation:Deepset ai Haystack EvaluationRunResult
Overview
EvaluationRunResult is a Haystack class that contains the inputs and outputs of an evaluation pipeline run and provides methods to generate aggregated, detailed, and comparative reports in multiple formats.
Implements Principle
Principle:Deepset_ai_Haystack_Evaluation_Result_Reporting
Source Location
haystack/evaluation/eval_run_result.py (Lines 18-229)
Import
from haystack.evaluation import EvaluationRunResult
Dependencies
- pandas (optional) -- Required only for DataFrame output format. Install via:
pip install pandas - csv (standard library) -- Used for CSV output.
API
Constructor
def __init__(
self,
run_name: str,
inputs: dict[str, list[Any]],
results: dict[str, dict[str, Any]]
):
Parameters:
- run_name (
str) -- Name of the evaluation run (used as identifier in comparative reports). - inputs (
dict[str, list[Any]]) -- Dictionary of inputs used for the run. Each key is an input name and its value is a list of input values. All lists must have the same length. - results (
dict[str, dict[str, Any]]) -- Dictionary of evaluator results. Each key is a metric name and its value is a dictionary with:- score (
float) -- The aggregated score for the metric. - individual_scores (
list) -- A list of scores for each input sample. Must match the length of input lists.
- score (
Raises:
ValueError-- If no inputs are provided, input list lengths differ, aggregate score is missing, individual scores are missing, or individual score length does not match input length.
Attributes
- run_name (
str) -- Name of the evaluation run. - inputs (
dict) -- Deep copy of the provided inputs. - results (
dict) -- Deep copy of the provided results.
aggregated_report()
def aggregated_report(
self,
output_format: Literal["json", "csv", "df"] = "json",
csv_file: str | None = None
) -> dict[str, list[Any]] | DataFrame | str:
Generates a report with aggregated scores for each metric.
Parameters:
- output_format (
str, default:"json") -- Output format:"json","csv", or"df". - csv_file (
str | None) -- File path for CSV output. Required whenoutput_format="csv".
Returns:
- JSON format:
{"metrics": [...], "score": [...]} - DataFrame format: A pandas DataFrame with metrics and scores.
- CSV format: A string message confirming successful write.
detailed_report()
def detailed_report(
self,
output_format: Literal["json", "csv", "df"] = "json",
csv_file: str | None = None
) -> dict[str, list[Any]] | DataFrame | str:
Generates a report with per-query scores for each metric, alongside the input data.
Parameters: Same as aggregated_report().
Returns: A combined dictionary/DataFrame/CSV containing input columns and individual score columns for each metric.
comparative_detailed_report()
def comparative_detailed_report(
self,
other: EvaluationRunResult,
keep_columns: list[str] | None = None,
output_format: Literal["json", "csv", "df"] = "json",
csv_file: str | None = None
) -> dict | DataFrame | str:
Generates a side-by-side comparison of two evaluation runs.
Parameters:
- other (
EvaluationRunResult) -- Results of another evaluation run to compare with. - keep_columns (
list[str] | None) -- List of common input column names to keep from both runs. IfNone, all input columns from the first run are included (but not duplicated from the second). - output_format (
str, default:"json") -- Output format. - csv_file (
str | None) -- File path for CSV output.
Returns: Combined data with columns prefixed by run names (e.g., run_a_mrr, run_b_mrr).
Raises:
ValueError-- Ifotheris not anEvaluationRunResultor is missing required attributes.
Warnings:
- Logs a warning if the two run names are identical.
- Logs a warning if the input columns differ between runs.
Usage Example
from haystack.evaluation import EvaluationRunResult
# Create from evaluation pipeline outputs
run_result = EvaluationRunResult(
run_name="rag_pipeline_v1",
inputs={
"questions": ["What is Python?", "Who created Java?", "What is C++?"],
},
results={
"mrr": {"score": 0.833, "individual_scores": [1.0, 0.5, 1.0]},
"recall": {"score": 0.667, "individual_scores": [1.0, 0.0, 1.0]},
},
)
# Aggregated view
agg = run_result.aggregated_report()
print(agg)
# {"metrics": ["mrr", "recall"], "score": [0.833, 0.667]}
# Detailed view
detail = run_result.detailed_report()
print(detail)
# {"questions": [...], "mrr": [1.0, 0.5, 1.0], "recall": [1.0, 0.0, 1.0]}
# Export to CSV
run_result.detailed_report(output_format="csv", csv_file="results.csv")
# Export to DataFrame
df = run_result.detailed_report(output_format="df")
Comparative Example
from haystack.evaluation import EvaluationRunResult
run_a = EvaluationRunResult(
run_name="bm25_retriever",
inputs={"questions": ["Q1", "Q2"]},
results={"mrr": {"score": 0.75, "individual_scores": [1.0, 0.5]}},
)
run_b = EvaluationRunResult(
run_name="embedding_retriever",
inputs={"questions": ["Q1", "Q2"]},
results={"mrr": {"score": 0.90, "individual_scores": [1.0, 0.8]}},
)
comparison = run_a.comparative_detailed_report(run_b, keep_columns=["questions"])
print(comparison)
# {"questions": ["Q1", "Q2"],
# "bm25_retriever_mrr": [1.0, 0.5],
# "embedding_retriever_mrr": [1.0, 0.8]}
Internal Design
Data Immutability
Inputs and results are deep-copied at initialization time to prevent external mutation from affecting the stored data.
Type Consistency
When generating detailed reports, if any value in an individual scores column is a float, all values in that column are cast to float for consistency.
Output Handling
The internal _handle_output() method routes formatted data to the appropriate output:
- json: Returns the dictionary directly.
- df: Converts to a pandas DataFrame (requires pandas).
- csv: Writes to a file using the
_write_to_csv()static method.
Important Notes
- Validation at init: All validation (matching lengths, required keys) happens at construction time. This fail-fast approach prevents runtime surprises.
- pandas is optional: The
pandaslibrary is only required when usingoutput_format="df". It is lazy-imported. - Column naming in comparisons: In comparative reports, metric columns are prefixed with the run name (e.g.,
run_name_metric). - No pipeline coupling: EvaluationRunResult is a standalone data container. It does not depend on Pipeline or any evaluator component.