Implementation:Microsoft Semantic kernel QualityCheck NLP Server
| Knowledge Sources | |
|---|---|
| Domains | Python, FastAPI, NLP_Evaluation, Quality_Metrics |
| Last Updated | 2026-02-11 00:00 GMT |
Overview
FastAPI server providing NLP evaluation endpoints for summarization and translation quality metrics (BERT score, METEOR, BLEU, COMET), used as part of the QualityCheck demo in the Semantic Kernel repository.
Description
This Python file implements a FastAPI web server that exposes four HTTP POST endpoints for computing NLP evaluation metrics. The server uses the HuggingFace evaluate library for BERT, METEOR, and BLEU scores, and the comet library (Unbabel) for COMET translation quality scores. It is part of the QualityCheck demo sample that demonstrates how to integrate NLP quality metrics into Semantic Kernel workflows.
The server defines two Pydantic request models:
- SummarizationEvaluationRequest - Accepts
sourcesandsummaries(lists of strings) for evaluating summarization quality - TranslationEvaluationRequest - Accepts
sourcesandtranslations(lists of strings) for evaluating translation quality
Four endpoints are provided:
- POST /bert-score/ - Computes BERTScore (precision, recall, F1) comparing summaries to source references
- POST /meteor-score/ - Computes METEOR score for summarization evaluation
- POST /bleu-score/ - Computes BLEU score for summarization evaluation
- POST /comet-score/ - Computes COMET score using the Unbabel wmt22-cometkiwi-da model for translation quality evaluation
Usage
This server is run as a standalone FastAPI application during the QualityCheck demo. A .NET Semantic Kernel application calls these endpoints to evaluate the quality of AI-generated summaries and translations. Developers would start this server locally when running the QualityCheck demo to provide the NLP scoring backend. It requires Python dependencies including fastapi, evaluate, comet, and the underlying models.
Code Reference
Source Location
- Repository: Microsoft_Semantic_kernel
- File: dotnet/samples/Demos/QualityCheck/python-server/app/main.py
- Lines: 1-40
Signature
# Copyright (c) Microsoft. All rights reserved.
from typing import List
from pydantic import BaseModel
from fastapi import FastAPI
from evaluate import load
from comet import download_model, load_from_checkpoint
app = FastAPI()
class SummarizationEvaluationRequest(BaseModel):
sources: List[str]
summaries: List[str]
class TranslationEvaluationRequest(BaseModel):
sources: List[str]
translations: List[str]
@app.post("/bert-score/")
def bert_score(request: SummarizationEvaluationRequest):
bertscore = load("bertscore")
return bertscore.compute(predictions=request.summaries, references=request.sources, lang="en")
Import
# Run the FastAPI server using uvicorn
cd dotnet/samples/Demos/QualityCheck/python-server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| sources | List[str] | yes | List of source/reference texts to compare against |
| summaries | List[str] | yes (summarization) | List of generated summaries to evaluate (used by /bert-score/, /meteor-score/, /bleu-score/) |
| translations | List[str] | yes (translation) | List of generated translations to evaluate (used by /comet-score/) |
Outputs
| Name | Type | Description |
|---|---|---|
| /bert-score/ response | object | BERTScore results with precision, recall, and F1 arrays for each prediction |
| /meteor-score/ response | object | METEOR score result with a single meteor float value |
| /bleu-score/ response | object | BLEU score result with bleu float value and precisions, brevity_penalty arrays |
| /comet-score/ response | object | COMET model prediction results with scores for each source-translation pair |
Usage Examples
Calling the BERT Score Endpoint
import requests
response = requests.post("http://localhost:8000/bert-score/", json={
"sources": ["The cat sat on the mat."],
"summaries": ["A cat was sitting on a mat."]
})
result = response.json()
# result contains: {"precision": [...], "recall": [...], "f1": [...], "hashcode": "..."}
print(f"BERT F1 Score: {result['f1'][0]:.4f}")
Calling the COMET Translation Score Endpoint
import requests
response = requests.post("http://localhost:8000/comet-score/", json={
"sources": ["The weather is nice today."],
"translations": ["El clima es agradable hoy."]
})
result = response.json()
# COMET score for translation quality evaluation
print(f"COMET Score: {result}")