Designing LLM curricula
In a previous post, we discussed the need to examine an LLM's potential using benchmarks and arena leaderboards. This post will explore the idea of designing a curriculum to train an LLM to excel at your desired task.
Training, validation and benchmark
A helpful way to think about this is to use an example most of us have been through…. high school exams. In England, the exams you sit at 16 are called GCSEs and at 18 A-Levels.
The school system teaches the content required to sit the exam. The teaching includes examples, explanations, practice, feedback and help you understand common mistakes and techniques to do well. This is all training data.
Teachers then often use past papers to assess knowledge, skills and capabilities so they can identify areas of weakness. This is validation.
Finally, students sit the high-stakes exams. These are a way to benchmark every student in what they’ve learned.
Each of these will require it’s own curated datasets. These are training, validation and benchmark datasets. You can learn more about benchmarking for LLMs in a previous post.
Keep the datasets separate
You don’t want your benchmark datasets to ever be using in training or validation. This will lead to overfitting. This means that if your model gets the right answer, it might just be playing back the data you used in training, rather than having the skills and knowledge to perform the task in a way that is more generalised.
This is also true when it comes to training and validation. Don’t use validation data in training.
It’s hard to keep everything completely independent. Just by the idea that you are trying to improve your performance against your validation and benchmark datasets, you will be skewing your training data towards those outcomes.
It’s essential that the validation and benchmark datasets are a good representation of the skills and knowledge that you want your model to demonstrate.
Thinking about norms, knowledge, skills and capabilities
Fine-tuning can be used to solve a few problems. The first is about norms.
Norms are about the language you might use. Walmart refers to employees as Associates. Disney refers to customers as Guests. These are the norms for them, but not necessarily for other retailers or entertainment companies. LLMs capture norms in their training data. If most people talk about employees, then the LLM will more likely talk about employees, not Associates. Every industry and business have their own lingo. Most domains has specific language, abbreviations, acronyms and terms. LLMs can be fine-tuned to better respond using the language of your choice. You might also want to ensure that your spoken language is better represented in your LLM choice.
The second reason might be knowledge. LLMs capture a lot of information in their models, often combining knowledge and language rules. LLMs can extend their knowledge using RAG (retrieval augmented generation) methods, which can make available up-to-date and proprietary information. But, it’s sometimes easier to train the model with a base set of knowledge and rules that can still be extended by RAG. This is a choice based on the size of context windows, the amount and volatility of the information required to be searched and supplied. We will cover this in more detail in a future post.
The final reason is more about the skills and capabilities of the LLM. Is the LLM a good conversationalist? Does it know when to say ‘I don’t know’? Does it know how to escalate a query to a human? Is it good at writing emails? Or summarisation tasks? Or classification tasks? Is it good at logic or solving complex problems?
Each task is something that might be best to train. In this way, you can think of the LLM as an employee. You need to assess what they’re good at, where the gaps are, and how best to fill the gap.
A focus on high-quality data
One of the most important breakthroughs in Llama 3 can be seen in how they have curated their training data. Llama 3 is the best performing model in its class and the data is considered one of the primary reasons.
You can find some really good high-quality fine-tuning data sets curated here. (thank you Maxime Labonne for creating this resource!)
These can be used to improve open, foundational models to perform better in the area you need.
Thinking about curriculum
The most important thing, though, is to start building datasets that train, validate, and benchmark your models for your tasks. To do this, it’s best to think of the training as a curriculum.
This might include the knowledge you want the model to be able to recall well. Think about how Maths is split into Number, Geometry, Algebra, Trigonometry etc.
It should also include the sorts of tasks and the processes you want the model to follow. Let’s say that the LLM must be able to hold a customer conversation that can help a user log in to their account, but the customer starts to talk about a refund. If this happens, you want the model to escalate to a manager. This is the sort of process that you can train by example.