Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Evidentlyai Evidently Legacy BERTScore Feature

From Leeroopedia
Knowledge Sources
Domains ML Monitoring, NLP, Text Similarity
Last Updated 2026-02-14 12:00 GMT

Overview

Computes BERTScore (precision, recall, and F1) between two text columns using BERT embeddings, with optional TF-IDF weighting.

Description

The BERTScoreFeature class extends GeneratedFeature to compute a BERTScore-based similarity metric between two text columns. BERTScore measures the semantic similarity between text pairs by comparing their contextual BERT embeddings at the token level.

The implementation follows these steps:

  1. Tokenization and Embedding: Both columns are tokenized using a BertTokenizer and embedded via a BertModel (defaulting to "bert-base-uncased"). The last hidden state is extracted as token-level embeddings.
  1. IDF Score Computation: The compute_idf_scores() method computes inverse document frequency (IDF) scores across all sentences in both columns combined. It counts document frequency for each unique token (including special [CLS] and [SEP] tokens) and applies -log(count/M) where M is the total number of sentences. IDF arrays are padded to uniform length for batch processing.
  1. Similarity Calculation: The max_similarity() method computes a cosine similarity matrix between token embeddings of two sentences and returns the maximum similarity for each token in the first set.
  1. Score Aggregation: The calculate_scores() method computes recall and precision. When tfidf_weighted is True, the max similarity scores are weighted by IDF values and normalized by IDF sums. When False, simple mean of max similarities is used. The final score is the F1 harmonic mean of precision and recall.

The feature produces a single numerical column named by joining the input column names with a pipe character.

Usage

Use this feature when you need to measure semantic similarity between two text columns, such as comparing generated responses against reference answers, evaluating translation quality, or monitoring text generation drift. The TF-IDF weighting option down-weights common tokens and up-weights rare, informative tokens in the similarity computation.

Code Reference

Source Location

Signature

class BERTScoreFeature(GeneratedFeature):
    class Config:
        type_alias = "evidently:feature:BERTScoreFeature"

    __feature_type__: ClassVar = ColumnType.Numerical
    columns: List[str]
    model: str = "bert-base-uncased"
    tfidf_weighted: bool = False

    def generate_feature(self, data: pd.DataFrame, data_definition: DataDefinition) -> pd.DataFrame: ...
    def compute_idf_scores(self, col1: pd.Series, col2: pd.Series, tokenizer) -> tuple: ...
    def max_similarity(self, embeddings1, embeddings2): ...
    def calculate_scores(self, emb1, emb2, idf_scores, index): ...
    def _feature_name(self): ...
    def _as_column(self) -> ColumnName: ...

Import

from evidently.legacy.features.BERTScore_feature import BERTScoreFeature

I/O Contract

Inputs

Name Type Required Description
columns List[str] Yes A list of exactly two column names: the candidate text column and the reference text column.
model str No The pretrained BERT model identifier. Defaults to "bert-base-uncased".
tfidf_weighted bool No Whether to weight token similarities by IDF scores. Defaults to False.
data pd.DataFrame Yes (at generation time) The input DataFrame containing the text columns specified in columns.
data_definition DataDefinition Yes (at generation time) The data definition describing the dataset schema.

Outputs

Name Type Description
return pd.DataFrame col2").

Usage Examples

from evidently.legacy.features.BERTScore_feature import BERTScoreFeature

# Basic BERTScore between two text columns
bert_feature = BERTScoreFeature(
    columns=["generated_answer", "reference_answer"],
)

# With TF-IDF weighting and a specific model
bert_feature_weighted = BERTScoreFeature(
    columns=["prediction", "ground_truth"],
    model="bert-base-uncased",
    tfidf_weighted=True,
)

# Generate features from a DataFrame
# result_df = bert_feature.generate_feature(data=my_dataframe, data_definition=my_data_def)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment