Embedding Based Evaluation Metrics: Strengths and Weaknesses

We explore existing embedding-based evaluation metrics and review their benefits and pitfalls.

As we discussed in a previous blog post, evaluating the output of LLMs is challenging. Since there are so many ways to express one thought in a given language, it can be difficult to assess whether generated text is “correct” even when we have an answer to use as a reference.

Consider the following question and answer pairs:

If we had “Ref” available as a reference answer, and tasked some humans to assess whether the other responses were correct, it shouldn’t be too hard for the graders to identify that “Subset” and “Reword” are correct, while “Wrong,” “Mislead” and “Irrelevant” are not. But human graders are slow and expensive! How might we be able to automate the grading of these responses?

One general strategy might be to use string manipulation to quantify what words and phrases from the reference answer show up in the answer being graded, and consider the candidate answer correct if there is enough overlap. For example, if the phrase “the ancient Greeks” and “attributed to the poet Homer” show up in the candidate answer, that’s a good sign considering that the reference answer has those phrases too. This is essentially the premise of many popular supervised metrics, such as BLEU, ROUGE, and METEOR. Let’s look at how one of these metrics (BLEU) fares when grading these responses:

This approach works well for identifying “Subset” as correct. As its name suggests, “Subset” is mainly just a shortened version of “Ref.” You might hope this would be an effective way to identify “Reword” as correct as well, but a close examination will show that “Reword” shares almost no overlapping phrases, or even overlapping keywords with “Ref.” Nonetheless, it is clearly just a slight paraphrase of “Ref.” It appears that what we require is a mechanism that can not only capture exact matches between words/phrases but also capture some of the meaning associated with those words and phrases. Luckily, embeddings provide us with some great tools to do just that.

Background on embeddings

An embedding is a way to map portions of text into a vector space.

It is somewhat like hashing a string: for every given string and hashing (embedding) algorithm, there is one unique number (vector) that it will hash (embed) to.

There is one important respect in which hashing and embedding differ dramatically. Hashing algorithms are designed to hide information about the input string: two strings that are only slightly different should hash to totally different values.

With embeddings, two strings that represent similar meaning should embed to vectors that are close to one another. So the vector for “war” and the vector for “battlefield” would likely be closer to one another than the vector for “home” is to either.

Many embedding mechanisms preserve additional details in the geometry of the vectors they produce, beyond just proximity in meaning or geometric distance. For example, Word2Vec was shown to be capable of solving analogy problems by simple math on its vectors: the vector for “king,” minus the vector for “man”, plus the vector for “woman” would land you at a point where the closest vector was for the word “queen.”

Just like hashing has many algorithms (SHA256, MD5, etc.) with different properties, so there are many different algorithms you can use to produce an embedding. Some work by first breaking the text into words or tokens, embedding the words/tokens individually, then aggregating the individual word embeddings in some way (ex: averaging word embeddings to produce the overall textual embedding is referred to as “Continuous Bag of Words” or “CBOW”). Such algorithms can be fast and efficient, but are prone to certain failure modes. Other techniques rely on embedding larger segments of text (such as sentences, paragraphs, or entire documents) as a unit to produce their embeddings. Even within these general approaches, there is room for much variation:

  • There are different algorithms for training an embedding model
  • Different datasets can be used to train the embedding model (ex: to perform better at doing embeddings for text within a specific domain)
  • Different algorithms can be used for aggregating embeddings of portions of the entire text to result in the embedding for the text as a whole

The simplest metrics: proximity

Once you have a reference, some text to grade against it, and an embedding scheme, you still need to find some way to translate it all to some form of “score”. One of the simplest ways is to measure how “close” the reference and candidate vectors are. But what does it mean for one vector to be “close” to another?

The simplest way to measure proximity is the Euclidean distance.

The Euclidean distance between a reference vector R and a candidate vector C in an n-dimensional embedding space.

How do we interpret this kind of measure in practice? Given three vectors, we can determine which of two candidates is closer to the reference than the other, but if I just tell you that the distance between the candidate and reference is 7: does that count as close? It’s hard to say.

In fact, it might be close or far depending on the algorithm used and the data it was trained with. Most algorithms only learn relative distances, to ensure that we can solve problems like “which candidate is closer.”

Additionally, many aggregation mechanisms are sensitive to the length of the text being compared. For example, if we aggregated word embeddings for a selection of text, and were comparing to the reference word “spam”, then “airplane” might be closer in Euclidean distance than “spam spam spam spam spam” (which seems wrong). Consider the task of determining whether a summary text is “close” to the source text it is derived from—even if the summary is talking about the same topics as the article, if the aggregation is sensitive to length we might wind up with a large Euclidean distance between the summary and document.

If raw Euclidean distance doesn’t work, a natural next step might be to measure the angle between the reference and candidate: if the angle is small then the two vectors are likely representing text that’s “talking about the same thing.” And since the angle has to be between 0 and π radians (aka between 0° and 180°), it’s a little more interpretable than the Euclidean distance. Unfortunately, measuring the angle between two vectors is somewhat computationally expensive. Luckily, it turns out that computing the cosine of the angle between two vectors is much cheaper, and has essentially the same benefits. The cosine of the angle will be between -1 and 1. The closer the cosine is to 1, the closer the vectors are to pointing in the same direction. The formula for this so-called cosine similarity is as follows:

The cosine similarity between a reference vector R and a candidate vector C in an n-dimensional embedding space.

To visualize how the cosine similarity and Euclidean distance relate to one another, consider the following diagram.

Illustration of a two dimensional embedding space, where the embedding for a selection of text is produced by adding together the embeddings for the individual words in the text. Note that the x and y axes correspond to features learned by the embedding algorithm, and may not be immediately interpretable to humans.

Let’s try these measures out on our “Odyssey” example:

Euclidean distance vs cosine similarity for the example answers, with embeddings performed by the sentence-transformers/average_word_embeddings_glove.840B.300d embedding model.

Both distance metrics roughly follow the trend that the correct answers tend to have closer proximity than the incorrect ones, though neither are a perfect discriminator (”Wrong” gets a score closer than “Reword” for both cosine similarity and Euclidean distance). To get a sense for the differences between Euclidean and cosine proximity measures it’s instructive to use this same embedding algorithm on another “answer” composed simply of the words “Greek epic poem.” This answer scores a cosine similarity of 0.76 (higher than “Reword”’s cosine similarity) and a Euclidean distance of 3.52 (worse than ALL the Euclidean distances).

Comparing across embedding algorithms

Unfortunately, in the example above, “Wrong” and “Mislead” were both scored as more close to the reference than “Reword” was. There’s no threshold value we could pick where all the correct answers are closer than the threshold and all incorrect ones are farther than it. Let’s see what happens when we use one based on the BERT model, which is commonly used for embeddings. We’ll also use OpenAI’s recommended embedding model, ada-002.

Cosine similarities between the embedded vector for the reference vector and the embedded vector for the other answers. Three embedding algorithms are used: a GLOVE based avg word model, a DistilBERT model, and OpenAI’s ada-002.

Across all algorithms, the average of the correct answers is higher than the average of the incorrect answers. However, ALL these algorithms score “Wrong” more highly than “Reword,” and all but ada-002 score “Mislead” more highly than “Reword.” Why is this? In the case of “Wrong”, it (unlike “Reword”) uses some of the exact same words to talk about the same general subjects: the ancient Greeks, Homer, Troy, gods… In the case of “Mislead,” a lot of the same words are used, but with very different meanings (”Journey” as a band rather than a trip, “Homer” as a Simpson rather than as a poet, etc.).

Cosine similarities between the embedded vector for the reference vector and the embedded vector for the other answers. Three embedding algorithms are used: a GLOVE based avg word model, a DistilBERT model, and OpenAI’s ada-002.

It seems that these embedding algorithms don’t do a great job at capturing the precise semantics of the text, but are still able to correlate somewhat with the relevance of the candidates to the reference. For this reason, you will more often see cosine similarity used in cases like recommendation algorithms or semantic search, where it’s more important to find documents talking about similar things than it is to distinguish correctness. Can we do better?

Formal Embedding Metrics

Despite the failure of the somewhat simplistic proximity measures to accurately classify correctness, more complex metrics might be able to get better results.

One example of an embedding-based metric designed to score document similarity is BertScore. Some ways that it improves upon direct proximity is to compare contextualized embeddings of the tokens (roughly corresponding to words/word fractions). This means that the vector for each token contains some information not only about that word, but also the words around it. This might be able to help distinguish “Homer Simpson” from “the poet Homer,” for example. Using these contextual embeddings, it attempts to greedily match each token from the reference with the best matching token from the candidate, then computes the average cosine similarity between them (technically this is one of three main variations on the BertScore: BertScore-R).

BertScore-R, as calculated by the default settings of the BertScore metric in the summ_eval library.

Here, we see that despite using a BERT based embedding like the one in our cosine similarity, we are now able to correctly classify that Reword is closer to the correct answer than Mislead. This is likely because the extra contextual information leveraged by BertScore is able to distinguish the different uses of words in the two cases. Sadly, “Wrong” is still scored more highly than “Reword.” This is likely because while our improved metric can pick up on the difference between “Homer the Simpson” and “Homer the Poet”, it still isn’t able to pick up on more subtle semantic differences, like the fact that Homer is being referred to as the author in “Reword,” but as the protagonist in “Wrong.” Are we stuck yet? Fortunately we have one more trick up our sleeves.

AI Metrics

Most people who have been paying attention to the recent LLM craze might be surprised that these mechanisms aren’t capable of reliably detecting the difference between a simple paraphrased description of the Odyssey and a fake song about a sitcom character. With these advances, it is possible to define a metric where we simply ask an LLM whether a candidate answer is consistent with a reference. I supplied the following prompt to GPT-3.5-turbo:

You are an impartial judge whose role is to determine
the consistency of two descriptions. If the descriptions are 
consistent with one another, give them a rating of 1. If they are
inconsistent, give them a rating of 0. Output
your rating in the format 'RATING: 0' or 'RATING: 1'.

{reference answer}
{candidate answer}

I then extracted a score of either 0 or 1 depending on whether ‘RATING: 0’ or ‘RATING: 1’ showed up in the response. Here are the results:

This is exactly what a human would score! The power of LLMs to understand the semantics of text in general seems to surpass the ability of even complex embeddings based mechanisms.


So…why would one ever use embeddings based metrics for scoring text against a reference? First of all, you shouldn’t take one anecdotal example as proof that embeddings based metrics perform poorly. The intent of this article is to illustrate, not to prove. Embeddings based metrics have been shown in real world use-cases to correlate well with human judgements (the paper on BertScore, for example, cites a Pearson correlation with human judgement of >0.99 on the WMT18 machine translation scoring task).

But setting aside that caveat, why not always use AI-based metrics? LLMs are constantly improving at a rapid pace and already appear to have a greater ability for semantic understanding compared to the other metrics discussed here. Ultimately, it comes to a tradeoff of time and compute cost against fidelity. In general, the more reliable you want your metrics to be, the more expensive it will be to compute them. As a reference for the metrics discussed in this article, here is a table showing the time it took to compute the metrics (across all candidate answers) for a variety of different metrics:

Sample time in seconds to compute select metrics on the author’s 2022 M1 Macbook Pro

Also worth noting is that while the others ran comfortably on my laptop, GPT-3.5 required an API call to a service that almost certainly could not run there. There is a clear correlation here between metrics which capture more and more semantics and ones which require more and more time to calculate. Thus, embeddings based metrics likely still have a place when the tradeoff between compute time and fidelity becomes important.

As usual though, the most expensive time is that of the humans controlling the computers. If you want to quickly assess your use case across a variety of OSS and closed LLMs, using AI metrics and more, check out the free evaluation tool we’re working on at Airtrain.ai.

The Airtrain Al Youtube channel

Subscribe now to learn about Large Language Models, stay up to date with Al news, and discover Airtrain Al's product features.

Subscribe now