How Do You Evaluate Large Language Model Apps — When 99% is just not good enough?
LLMs are fundamentally changing the way practitioners evaluate performance . Let’s look at the recent progress towards evaluating LLMs in production.
Prior to the release of Large Language Models (LLMs) like ChatGPT that do extremely well out of the box on custom use-cases, model evaluations were fairly typical. You split your data into training/test/dev sets — trained your model on the training set, and evaluated performance on the test/dev set. I’m also including transfer learning/fine-tuning of models in this category . The slight difference is that models trained from scratch would have an accuracy before training amounting to making random choices. Whereas choosing a pre-trained model for fine-tuning might already have a decent performance on unseen data due to a general understanding of semantics — considerably improved by fine-tuning on custom data.
We are seeing LLMs like ChatGPT, GPT-4, Claude, BARD, etc. perform extremely well in most cases, but also poorly in some cases. These models are known to “hallucinate” — making it extremely hard to judge performance. Let’s step back and understand why LLMs hallucinate, but this was seemingly not a problem for earlier models.
Till a couple of years ago, ML models were mainly used for specific tasks and performed well at these narrow tasks. Some examples include predicting sentiments, or recommendation systems that predicted which movie or product a user is likely to click on. But almost always, these models were never used out of the box. You would either train them on certain data, or fine-tune language models like BERT that have been trained on a broad corpus of data, to user specific scenarios. It was not uncommon to see large performance boosts of 20% or more when fine-tuning models on custom data. (I’ve written an article on fine-tuning transformers that might be useful):
However, I would argue that these models were not meant to see data outside their training set. E.g. if I fine-tuned a BERT model on extracting drugs prescribed by doctors from medical documents, I would not expect the same model to perform well on extracting dog breeds from adoption documents. If the model extracted useless information, this wouldn’t be called “hallucination”.
However, LLMs promise to be applicable over various sorts of tasks because of their model size and the large amounts of data they have seen as well as training regimens they went through — which is a gamechanger. But when wrong, the results can be quick deceiving, giving the impression that these sophisticated models are purposely giving the wrong answer even if they “know” the right answer — or that the ones who trained the model purposely chose to make the model spew nonsense that is untrue, yet looks plausible.
When an LLM fails, it does not fail spectacularly as a traditional ML model — but the impressions it makes are spectacular, like a child that’s been caught lying even though they know better.
On the bright side, the community is slowly finding ways to bound the space of answers LLMs can give — or have guardrails so that the answers given are either true, or the model acquiesces that it does not know the right answer. This can be done through prompt engineering, model parameter tuning, connecting to external data stores, etc. (See this article on Retrieval Augmented Generation — RAG which might be useful)
It becomes considerably important that we focus on evaluating this larger LLM paradigm as a whole, instead of focusing solely on the ML/LLM model itself — as we are so used to.
Let’s have a look at some recent innovations in evaluating LLMs, but first — let’s talk about traditional model evaluation.
Traditional Evals
Depending on the language task — there are many standard evaluation metrics. For extractive tasks like extracting the right answer from the context, for a question — exact match and F1 are used for comparing the string output by the model and the “gold standard” answer.
For example, the gold standard answer for the question “To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?”
with context:
“Architecturally, the school has a Catholic character. Atop the Main Building’s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend “Venite Ad Me Omnes”. Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.”
Is: “Saint Bernadette Soubirous.”
Let’s say the answer given by the “to Saint Bernadette Soubirous”.
From an exact match metric, the score of this answer would either be 1 or 0–in this case it is 0 because the model output didn’t exactly match the gold standard answer. However, in the case of the F1 score, this would have an F1 score of 0.75. This is because the F1 score compares the words in the prediction and words in the gold standard reference.
F1 = 2*precision*recall/(precision+recall) where precision is the ratio of the number of words shared to the total number of words in the prediction. Recall is the ratio of the number of words shared to the total number of words in the gold standard.
For generative tasks like summarization, more advanced evaluation methods are necessary while computing similarity scores between abstractive, longer pieces of text. The ROUGE score measures the similarity between overlapping n-grams. The ROUGE-1 amounts to the overlap of unigrams — and is similar to the individual word scores discussed for Q&A above, whereas ROUGE-2 evaluates the overlap between bigrams in the reference and predicted summaries. The ROUGE-L metric is slightly different — where in it finds the longest common sequence between reference and prediction. ROUGE-L takes into account sentence-level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.
The BLEU score is another metric to evaluate long outputs — typically useful for tasks like language translation. It’s been shown that BLEU scores correlate well to human translation quality assessments.
Benchmark datasets like SuperGLUE evaluate linguistic skills and semantic capabilities, rather than language understanding. Here is an example from the SuperGLUE paper — evaluating performance over various tasks:
However, since 2019 — most newer language models show human-like performance on SuperGLUE. But as mentioned earlier, recent LLMs have no problems in many simple cases, but perform extremely poorly in edge cases and it is hard to know a priori what these cases are.
Evaluating language understanding
Since LLMs are particularly useful in scenarios where the user can interact with them freely and seamlessly, it is important to benchmark their performance in a wide range of scenarios that capture the universe of user inputs. The Measuring Massive Multitask Language Understanding (MMLU) paper, introduced a test covering 57 tasks including mathematics, history, computer science, law, etc. They uncovered cases where models have high performance in some categories, but poor performance in others.
Evaluating on “hard” tasks
In 2019, the HellaSwag paper introduced a clever method to generate data that is hard for language models to label in sentence completion tasks. Given an event description such as “A woman sits at a piano,” a model must select the most likely followup: “She sets her fingers on the keys.”
HellaSwag built on the SWAG dataset — wherein a model is given a context from a video caption and four possbile ending choices for what happens next. Only one choice is right — the actual next caption of the video.
On the SWAG dataset, models like BERT acheived near human accuracy (~88%). In HellaSwag however, adversarial filtering (AF) is introduced, in combination with a generator of negative candidates — wrong endings that violate human assumptions about the world. Through multiple iterations, AF creates a final dataset — easy for humans to classify, but difficult for models. Humans achieved >95% accuracy, whereas models like BERT could achieve only 48% accuracy.
LLM as a judge??
Creating annotations to evaluate LLMs can be challenging as it is expensive to annotate data at scale. This is especially important for using LLMs that are right in many cases, but when they are wrong — they can be highly manipulative. An interesting new concept in the MT-Bench paper was the idea to use LLMs as judges themselves — in peer reviewed assessments. As you can see below, they have various types of performance based on the tasks.
Ethical Evaluations
One problem LLMs face is their apparent ease at concocting data, and false answers. As an example — a judge recently sanctioned lawyers for brief written by ChatGPT with fake citations. TruthfulQA is a benchmark containing questions designed to cause models to give fake answers.
Interestingly, the larger versions of GPT-3 were found by the authors of the paper to be more likely to generate falsehoods as below:
Ok Great — But how do I use these evals in production?
Traditionally these evals have been used to figure out which ML model to use in production. But once you make that choice e.g. GPT-4 which is the undisputed leader (for now) — what more do you need? Would you just lie down and take hallucinations?
No — you still have a lot of scope to engineer whichever LLM you want into your application. Many aspects key to model performance that requires engineering are:
Prompt Tuning — how you frame the prompt can result in considerably different results. A simple example is summarization. If you want to summarize a news article, you need to focus on very different information as compared to summarizing a conversation with a healthcare provider.
Embeddings Retrieval — in many case you wish to connect custom data with LLMs so that you give LLMs the right context. Choosing which embeddings and similarity metrics are important.
Model Parameters — LLMs have multiple parameters like temperature, top-k, repeat penalty, etc.
Data Storage — For more advanced retrieval augmented generation (RAG) solutions, you might need to consider how to store and retrieve data — e.g. does storing document keywords alongside embeddings help in the retrieval of the correct information?
Once you have the parameters to adjust and evaluate model performance, you can decide on how to evaluate your model performance and what metrics to use.
It might be hard to come up with representative user inputs for generating evaluation datasets a priori — in which case you can use representative examples from the benchmark datasets discussed above (MTBench, MMLU, TruthfulQA) as a guideline.
There have been some recent developments on streamlining LLM evaluation. One example is RAGAS that attempt to evaluate LLMs based on various metrics like faithfulness, relevancy, harmfulness, etc. Another example is LangSmith (from the founders of LangChain), a framework to debug, test, evaluate, and monitor LLM applications.
While earlier in the year people were racing to use Generative AI as personal search engines, now the focus is on how to integrate these into production. There’s lots of exciting times ahead in productionizing, monitoring, and evaluating LLMs.
I look forward to reading more about what folks are building and how you evaluate/problems encountered in the comments!
Here are some related articles:
LLM Evaluations:
4 Crucial Factors for Evaluating Large Language Models in Industry Applications
Extractive vs Generative Q&A — Which is better for your business?
When Should You Fine-Tune LLMs?
Unleashing the Power of Generative AI For Your Customers
LLM Deployment:
Deploying Open-Source LLMs As APIs
LLM Systems Design:
Build Industry-Specific LLMs Using Retrieval Augmented Generation
How Do You Build A ChatGPT-Powered App?
Fine-Tune Transformer Models For Question Answering On Custom Data
LLM Economics: ChatGPT vs Open-Source
The Economics of Large Language Models
LLM Tooling:
Build A Custom AI Based ChatBot Using Langchain, Weviate, and Streamlit
How OpenAI’s New Function Calling Breaks Programming Boundaries