Designing LLM curricula

In a previous post, we discussed the need to examine an LLM's potential using benchmarks and arena leaderboards. This post will explore the idea of designing a curriculum to train an LLM to excel at your desired task.

Training, validation and benchmark

A helpful way to think about this is to use an example most of us have been through…. high school exams. In England, the exams you sit at 16 are called GCSEs and at 18 A-Levels.

The school system teaches the content required to sit the exam. The teaching includes examples, explanations, practice, feedback and help you understand common mistakes and techniques to do well. This is all training data.

Teachers then often use past papers to assess knowledge, skills and capabilities so they can identify areas of weakness. This is validation.

Finally, students sit the high-stakes exams. These are a way to benchmark every student in what they’ve learned.

Each of these will require it’s own curated datasets. These are training, validation and benchmark datasets. You can learn more about benchmarking for LLMs in a previous post.

Keep the datasets separate

You don’t want your benchmark datasets to ever be using in training or validation. This will lead to overfitting. This means that if your model gets the right answer, it might just be playing back the data you used in training, rather than having the skills and knowledge to perform the task in a way that is more generalised.

This is also true when it comes to training and validation. Don’t use validation data in training.

It’s hard to keep everything completely independent. Just by the idea that you are trying to improve your performance against your validation and benchmark datasets, you will be skewing your training data towards those outcomes.

It’s essential that the validation and benchmark datasets are a good representation of the skills and knowledge that you want your model to demonstrate.

Thinking about norms, knowledge, skills and capabilities

Fine-tuning can be used to solve a few problems. The first is about norms.

Norms are about the language you might use. Walmart refers to employees as Associates. Disney refers to customers as Guests. These are the norms for them, but not necessarily for other retailers or entertainment companies. LLMs capture norms in their training data. If most people talk about employees, then the LLM will more likely talk about employees, not Associates. Every industry and business have their own lingo. Most domains has specific language, abbreviations, acronyms and terms. LLMs can be fine-tuned to better respond using the language of your choice. You might also want to ensure that your spoken language is better represented in your LLM choice.

The second reason might be knowledge. LLMs capture a lot of information in their models, often combining knowledge and language rules. LLMs can extend their knowledge using RAG (retrieval augmented generation) methods, which can make available up-to-date and proprietary information. But, it’s sometimes easier to train the model with a base set of knowledge and rules that can still be extended by RAG. This is a choice based on the size of context windows, the amount and volatility of the information required to be searched and supplied. We will cover this in more detail in a future post.

The final reason is more about the skills and capabilities of the LLM. Is the LLM a good conversationalist? Does it know when to say ‘I don’t know’? Does it know how to escalate a query to a human? Is it good at writing emails? Or summarisation tasks? Or classification tasks? Is it good at logic or solving complex problems?

Each task is something that might be best to train. In this way, you can think of the LLM as an employee. You need to assess what they’re good at, where the gaps are, and how best to fill the gap.

A focus on high-quality data

One of the most important breakthroughs in Llama 3 can be seen in how they have curated their training data. Llama 3 is the best performing model in its class and the data is considered one of the primary reasons.

You can find some really good high-quality fine-tuning data sets curated here. (thank you Maxime Labonne for creating this resource!)

These can be used to improve open, foundational models to perform better in the area you need.

Thinking about curriculum

The most important thing, though, is to start building datasets that train, validate, and benchmark your models for your tasks. To do this, it’s best to think of the training as a curriculum.

This might include the knowledge you want the model to be able to recall well. Think about how Maths is split into Number, Geometry, Algebra, Trigonometry etc.

It should also include the sorts of tasks and the processes you want the model to follow. Let’s say that the LLM must be able to hold a customer conversation that can help a user log in to their account, but the customer starts to talk about a refund. If this happens, you want the model to escalate to a manager. This is the sort of process that you can train by example.

Get in touch

Have a question? Drop in your info, and let's start chatting.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_hjAbsoluteSessionInProgress	30 minutes	Hotjar sets this cookie to detect the first pageview session of a user. This is a True/False flag set by the cookie.
_hjFirstSeen	30 minutes	Hotjar sets this cookie to identify a new user’s first session. It stores a true/false value, indicating whether it was the first time Hotjar saw this user.
_hjid	1 year	This is a Hotjar cookie that is set when the customer first lands on a page using the Hotjar script.
_hjIncludedInPageviewSample	2 minutes	Hotjar sets this cookie to know whether a user is included in the data sampling defined by the site's pageview limit.
CONSENT	16 years 2 months 20 days 10 hours 10 minutes	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
uid	68 years 18 days 3 hours 14 minutes	This cookie is used to measure the number and behavior of the visitors to the website anonymously. The data includes the number of visits, average duration of the visit on the website, pages visited, etc. for the purpose of better understanding user preferences for targeted advertisments.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_pin_unauth	1 year	No description available.
_pinterest_ct_ua	1 year	No description available.
lastPage	1 hour	No description available.