Implementation:Evidentlyai Evidently Legacy BERTScore Feature
| Knowledge Sources | |
|---|---|
| Domains | ML Monitoring, NLP, Text Similarity |
| Last Updated | 2026-02-14 12:00 GMT |
Overview
Computes BERTScore (precision, recall, and F1) between two text columns using BERT embeddings, with optional TF-IDF weighting.
Description
The BERTScoreFeature class extends GeneratedFeature to compute a BERTScore-based similarity metric between two text columns. BERTScore measures the semantic similarity between text pairs by comparing their contextual BERT embeddings at the token level.
The implementation follows these steps:
- Tokenization and Embedding: Both columns are tokenized using a BertTokenizer and embedded via a BertModel (defaulting to "bert-base-uncased"). The last hidden state is extracted as token-level embeddings.
- IDF Score Computation: The compute_idf_scores() method computes inverse document frequency (IDF) scores across all sentences in both columns combined. It counts document frequency for each unique token (including special [CLS] and [SEP] tokens) and applies
-log(count/M)where M is the total number of sentences. IDF arrays are padded to uniform length for batch processing.
- Similarity Calculation: The max_similarity() method computes a cosine similarity matrix between token embeddings of two sentences and returns the maximum similarity for each token in the first set.
- Score Aggregation: The calculate_scores() method computes recall and precision. When tfidf_weighted is True, the max similarity scores are weighted by IDF values and normalized by IDF sums. When False, simple mean of max similarities is used. The final score is the F1 harmonic mean of precision and recall.
The feature produces a single numerical column named by joining the input column names with a pipe character.
Usage
Use this feature when you need to measure semantic similarity between two text columns, such as comparing generated responses against reference answers, evaluating translation quality, or monitoring text generation drift. The TF-IDF weighting option down-weights common tokens and up-weights rare, informative tokens in the similarity computation.
Code Reference
Source Location
- Repository: Evidentlyai_Evidently
- File:
src/evidently/legacy/features/BERTScore_feature.py
Signature
class BERTScoreFeature(GeneratedFeature):
class Config:
type_alias = "evidently:feature:BERTScoreFeature"
__feature_type__: ClassVar = ColumnType.Numerical
columns: List[str]
model: str = "bert-base-uncased"
tfidf_weighted: bool = False
def generate_feature(self, data: pd.DataFrame, data_definition: DataDefinition) -> pd.DataFrame: ...
def compute_idf_scores(self, col1: pd.Series, col2: pd.Series, tokenizer) -> tuple: ...
def max_similarity(self, embeddings1, embeddings2): ...
def calculate_scores(self, emb1, emb2, idf_scores, index): ...
def _feature_name(self): ...
def _as_column(self) -> ColumnName: ...
Import
from evidently.legacy.features.BERTScore_feature import BERTScoreFeature
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| columns | List[str] | Yes | A list of exactly two column names: the candidate text column and the reference text column. |
| model | str | No | The pretrained BERT model identifier. Defaults to "bert-base-uncased". |
| tfidf_weighted | bool | No | Whether to weight token similarities by IDF scores. Defaults to False. |
| data | pd.DataFrame | Yes (at generation time) | The input DataFrame containing the text columns specified in columns. |
| data_definition | DataDefinition | Yes (at generation time) | The data definition describing the dataset schema. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | pd.DataFrame | col2"). |
Usage Examples
from evidently.legacy.features.BERTScore_feature import BERTScoreFeature
# Basic BERTScore between two text columns
bert_feature = BERTScoreFeature(
columns=["generated_answer", "reference_answer"],
)
# With TF-IDF weighting and a specific model
bert_feature_weighted = BERTScoreFeature(
columns=["prediction", "ground_truth"],
model="bert-base-uncased",
tfidf_weighted=True,
)
# Generate features from a DataFrame
# result_df = bert_feature.generate_feature(data=my_dataframe, data_definition=my_data_def)
Related Pages
- Environment:Evidentlyai_Evidently_Python_Core_Environment
- Evidentlyai_Evidently_Legacy_Generated_Features - Base class GeneratedFeature that BERTScoreFeature extends.
- Evidentlyai_Evidently_Legacy_Exact_Match_Feature - A simpler text comparison feature for exact string matching.