Choosing the right LLM
Choosing the right LLM at the beginning of a new project is important as it can lead to much rework. Each LLM has its unique idiosyncrasies for getting the best answers. They require prompt engineering, and making them consistent and repeatable can be tricky.
How can you compare LLMs?
New LLMs are released every week, and there are now hundreds if not thousands. People are sharing why they think Gemini is better than GPT-4 or why Llama 3 is stronger than Mistral. Many people are fans of Anthropic’s Claude 3 Opus. You might have limited choices because of your infrastructure choices and where your data resides. Whether you are working with Microsoft, Amazon, Google or others, there are options for what models can be used without leaving your chosen environment.
Outside of any technical constraints and people’s opinions, how can you compare LLMs to understand which one might be best for you?
Two reliable methods are commonly employed for this purpose: Benchmarks and leaderboards. Both are effective tools for comparing LLMs and can be used together to assess what models might be best for you.
Benchmarks
You may have seen some of the common benchmarks such as MMLU, HellaSwag, HumanEval, GSM8K or TruthfulQA
When any new LLM is released, it comes with benchmark scores. You can see a whole host of benchmarks here.
These benchmarks explore different aspects of what an LLM might be able to do. They explore how good the LLM is at:
- Truthfulness: a test of whether the LLM produces truthful answers. e.g. TruthfulQA
- Reasoning, logic and common sense: A test of an LLM’s ability to apply logic and everyday knowledge to solve problems. e.g. HellaSwag (You can learn more about here) or BBH.
- Language understanding and Question Answering (QA): These evaluate a model’s ability to interpret text and answer questions accurately. e.g. Glue and MMLU
- Coding: Benchmarks in this category evaluate LLMs on their ability to interpret and generate code. e.g. HumanEval
- Chat and Conversation: a test to understand how well the model manages conversations. e.g. MT Bench
- Translation: These assess the model’s ability to translate text from one language to another accurately. e.g. WMT2014 English-French
- Maths: These focus on a model’s ability to solve math problems, from basic arithmetic to more complex areas such as calculus. e.g. GSM8K
When considering which benchmarks to look at, you should consider what’s important for your LLM use case. Each benchmark has detailed information about what it tests and performance benchmarks. Most of them can be found here.
If you want to learn more about how you might use a benchmark in your code, check out this page.
Arenas and Leaderboards
Beyond benchmarks, the LLM Arenas are where different models compete against each other in blind competitions.
LMSys Leaderboard is the most used arena. You can check it out here.
It uses an Elo ranking system (which you can learn more about here) to score LLM performance. What’s useful about arena is that they look at accuracy and preference. The benchmarks are just looking for how correct a model is, but the arena considers the LLM’s ability to generate good quality output.
People can put in a prompt that is run through two models (picked by the system). The user can then decide if Model A or B is better, whether they’re tied or whether neither of them did a particularly good job at the task given. In this way, human preference is being captured.
LMSys are building an amazing dataset that will be useful for future model training and testing.
Specialist Benchmarks
What we’ve covered so far is useful for foundational models and those that will be used for general use cases, but it lacks deep domain relevance. For instance, models are being developed specifically for medical knowledge. It’s more important that they are factually correct for medical details than knowing who the Kings and Queens of England were between 1000 and 2000.
As with many things with LLMs, HuggingFace has a good starting point to understand what is out there. Here is a leaderboard dedicated to medical knowledge.
If you are planning to launch a model that understands medical information, it would be good to test it against benchmarks such as MedQA or PubMedQA. MMLU does have medical questions and is still a relevant benchmark to use.
It’s worth exploring your own domain to see if there are relevant benchmarks and datasets to assess the models you are looking to use or the ones you are creating or fine-tuning.
Building your own benchmarks
When considering your models, the key thing is to assess for the task that most closely matches your use case. So, if you are looking for an LLM to help support Customer Service Agents, build your own benchmark that covers all of the different scenarios and questions you would want help with. If you want an LLM to write emails for you, create a benchmark or an arena that might help you figure out what models write the best. If you want an LLM to help field HR questions, ensure your benchmark covers the key questions you are asked and the acceptable answers.
If you have a system that manages conversations or QA and captures how humans are currently answering questions or interacting with customers, then these can be the best source for both training, validation and benchmark datasets.