Choosing the right LLM

Choosing the right LLM at the beginning of a new project is important as it can lead to much rework. Each LLM has its unique idiosyncrasies for getting the best answers. They require prompt engineering, and making them consistent and repeatable can be tricky.

How can you compare LLMs?

New LLMs are released every week, and there are now hundreds if not thousands. People are sharing why they think Gemini is better than GPT-4 or why Llama 3 is stronger than Mistral. Many people are fans of Anthropic’s Claude 3 Opus. You might have limited choices because of your infrastructure choices and where your data resides. Whether you are working with Microsoft, Amazon, Google or others, there are options for what models can be used without leaving your chosen environment.

Outside of any technical constraints and people’s opinions, how can you compare LLMs to understand which one might be best for you?

Two reliable methods are commonly employed for this purpose: Benchmarks and leaderboards. Both are effective tools for comparing LLMs and can be used together to assess what models might be best for you.

Benchmarks

You may have seen some of the common benchmarks such as MMLU, HellaSwag, HumanEval, GSM8K or TruthfulQA

When any new LLM is released, it comes with benchmark scores. You can see a whole host of benchmarks here.

These benchmarks explore different aspects of what an LLM might be able to do. They explore how good the LLM is at:

Truthfulness: a test of whether the LLM produces truthful answers. e.g. TruthfulQA
Reasoning, logic and common sense: A test of an LLM’s ability to apply logic and everyday knowledge to solve problems. e.g. HellaSwag (You can learn more about here) or BBH.
Language understanding and Question Answering (QA): These evaluate a model’s ability to interpret text and answer questions accurately. e.g. Glue and MMLU
Coding: Benchmarks in this category evaluate LLMs on their ability to interpret and generate code. e.g. HumanEval
Chat and Conversation: a test to understand how well the model manages conversations. e.g. MT Bench
Translation: These assess the model’s ability to translate text from one language to another accurately. e.g. WMT2014 English-French
Maths: These focus on a model’s ability to solve math problems, from basic arithmetic to more complex areas such as calculus. e.g. GSM8K

When considering which benchmarks to look at, you should consider what’s important for your LLM use case. Each benchmark has detailed information about what it tests and performance benchmarks. Most of them can be found here.

If you want to learn more about how you might use a benchmark in your code, check out this page.

Arenas and Leaderboards

Beyond benchmarks, the LLM Arenas are where different models compete against each other in blind competitions.

LMSys Leaderboard is the most used arena. You can check it out here.

It uses an Elo ranking system (which you can learn more about here) to score LLM performance. What’s useful about arena is that they look at accuracy and preference. The benchmarks are just looking for how correct a model is, but the arena considers the LLM’s ability to generate good quality output.

People can put in a prompt that is run through two models (picked by the system). The user can then decide if Model A or B is better, whether they’re tied or whether neither of them did a particularly good job at the task given. In this way, human preference is being captured.

LMSys are building an amazing dataset that will be useful for future model training and testing.

Specialist Benchmarks

What we’ve covered so far is useful for foundational models and those that will be used for general use cases, but it lacks deep domain relevance. For instance, models are being developed specifically for medical knowledge. It’s more important that they are factually correct for medical details than knowing who the Kings and Queens of England were between 1000 and 2000.

As with many things with LLMs, HuggingFace has a good starting point to understand what is out there. Here is a leaderboard dedicated to medical knowledge.

If you are planning to launch a model that understands medical information, it would be good to test it against benchmarks such as MedQA or PubMedQA. MMLU does have medical questions and is still a relevant benchmark to use.

It’s worth exploring your own domain to see if there are relevant benchmarks and datasets to assess the models you are looking to use or the ones you are creating or fine-tuning.

Building your own benchmarks

When considering your models, the key thing is to assess for the task that most closely matches your use case. So, if you are looking for an LLM to help support Customer Service Agents, build your own benchmark that covers all of the different scenarios and questions you would want help with. If you want an LLM to write emails for you, create a benchmark or an arena that might help you figure out what models write the best. If you want an LLM to help field HR questions, ensure your benchmark covers the key questions you are asked and the acceptable answers.

If you have a system that manages conversations or QA and captures how humans are currently answering questions or interacting with customers, then these can be the best source for both training, validation and benchmark datasets.

Get in touch

Have a question? Drop in your info, and let's start chatting.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	Hotjar sets this cookie to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	Hotjar sets this cookie to identify a new user’s first session. It stores a true/false value, indicating whether it was the first time Hotjar saw this user.
_hjid	1 year	This is a Hotjar cookie that is set when the customer first lands on a page using the Hotjar script.
_hjIncludedInPageviewSample	2 minutes	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's pageview limit.
CONSENT	16 years 2 months 20 days 10 hours 10 minutes	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
uid	68 years 18 days 3 hours 14 minutes	This cookie is used to measure the number and behavior of the visitors to the website anonymously. The data includes the number of visits, average duration of the visit on the website, pages visited, etc. for the purpose of better understanding user preferences for targeted advertisments.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_pin_unauth	1 year	No description available.
_pinterest_ct_ua	1 year	No description available.
lastPage	1 hour	No description available.