February 20, 2024

Large Language Models: to Fine-tune or not to Fine-tune?

Should we fine-tune a LLM for this use case? Or consider other techniques?

At ML6 that may very well be the #1 question we get from customers regarding the use of large language models (LLMs).

In short the choice of technique boils down to a difference in ambitions.

Defining Ambitions and techniques

Ambition: adjusting the behaviour of your model for a specific use case
→ Technique: Fine-Tuning, Few-Shot or Zero-Shot Prompting
→ Example: your model needs to provide answers in the specific style of Shakespeare.

Ambition: adjusting the knowledge your model has access to
→ Technique: Retrieval-Augmented Generation (solution architecture)
→ Example: your model needs to provide answers on what type of steel is needed for specific construction components.

Note that these ambitions are not exclusive alternatives; for many use cases you might want to change both the model’s behaviour ánd make sure your model has access to the right information.

Behaviour: Shakespeare; Knowledge: On design of steel and steel-concrete composite bridges

To those who now think: “Hey but can’t I also add knowledge through fine-tuning? Shouldn’t I fine-tune an LLM using my private knowledge base?”. To them we might say: “Hey, thanks for that spontaneous remark. Well yes that may be possible but it’s probably not the most efficient way of adding knowledge nor is it truly transparant nor manageable.” If you are specifically interested in having an LLM access specific knowledge in a maintainable way, we point you towards our post on leveraging LLMs on your domain-specific knowledge base.

To those who didn’t have that remark, we say “well than on we go now, shall we”. But not before taking in some inspiring, introductory words from Billiam Bookworm.

In this post, we’ll walk you through an understanding of fine-tuning and empower you with a tool with which you can make well-founded decisions. Of course this tool is the one that has ruled all tools since way before the computer was invented; the flowchart. Maybe times aren’t a-changin’ ?
‍

Understanding the Fine-Tuning process

To understand what can and cannot be achieved through fine-tuning, we must on some level understand what this process actually refers. What do we start from and how are we impacting the model? We warn you that this section my be a bit technical but crucial for a good understanding of fine-tuning.

In summary, a large language model is built through three distinct steps :

Unsupervised Learning (UL)

‍Data: low-quality (typically scraped internet data), ±1 trillion “words”
Process: optimised for text completion (predicting next “word”)
Result: a behemoth so monstrous it makes Medusa turn to stone

Supervised Fine-Tuning (SFT)

‍▹Data: 10k-100k “words” of curated [prompt, response] examples
▹Process: model is taught to behave based on input/output samples
▹Result: a behemoth with a somewhat acceptable face that you can stand to look at

Reinforcement learning from human feedback (RLHF)

▹Data: 100k-1M comparisons [prompt, won_response, lost_response]
▹Process: optimise the model to respond in the way that humans prefer
▹Result: a behemoth with a smiley face that you would want to go for drinks with

The development of an LLM visualised as the beast Shoggoth (courtesy of Helen Toner’s post)

The explanation above should give you a first grasp of what fine-tuning does and why it’s needed.
‍

To further strengthen your understanding, let’s look at an example:

🎸So you’ve got the knowledge, but have you got the touch? 🎸@unsupervised_behemoth

The example above shows how depending on Unsupervised Learning alone falls short. The model may have gained a lot of knowledge but it doesn’t know how to wield it. For a model that merely predicts the next words, a question may be the most likely continuation of a previous question, because when learning from its heaps of low quality data it probably came across quite some tests and subsequent questions.

But fear not, for Supervised Fine-Tuning swoops in to save the day! After gathering tons of knowledge from low quality data, the SFT process aims to get the behaviour of the model right. And it does this by demonstrating example behaviour to the model and optimising it to replicate that. From this the model will learn to understand “if I am asked a question, apparently I have to try and formulate an answer as a response”.

The above is traditional supervised learning. What seems to really unleash the full potential of these models is the RLHF step. We won’t go too much into detail but this process aims to specifically guide the model to behave in ways that people have indicated to prefer (giving way to the name: reinforcement learning from human feedback). Note that to enable the gains from RLHF, a Reward Model needs to first be built which calculates reward scores for given responses. And that requires an extensive amount of labeling and engineering work.

Luckily, when it comes to impacting model behaviour, SFT is the crucial step. It exemplifies how we want the model to behave, RLHF then further refines that because it’s easier for us humans to just show you what we prefer rather then explaining it through examples ourselves.

Now, make no mistake. Preparing the necessary data to perform SFT is no easy feat. In fact, for the development of GPT-3, the team at OpenAI relied on the input of freelancers to provide labeled data (for both the SFT and RLHF process). Because they understand the importance of this task, they made sure to choose labelers that were well educated. This was shown in the results of a survey carried out by OpenAI as part of their paper on “Learning to summarise from human feedback”.

Extract from appendix C on “Human data collection details” for this OpenAI paper

In the last section of this post we will zoom-in on what you need to actually perform the fine-tuning process. But first let us present you with a practical guide of deciding when to reach for SFT and when you can consider moving on without it.
‍

A practical guide to support your choice

As you may know, even closed source models (where you have no actual access to the model parameters themselves) may allow you to fine-tune them through an API. In those regards, OpenAI (23th August) released fine-tuning of GPT-3.5. This further extends the possibilities for the large public to get into modelfine-tuning. Thus hyping up the question: when should you actually go for it (as proven by the frequentliestic asked question for that fine-tuning API).

OpenAI #1 FAQ question regarding fine-tuning

The question? No clue. The answer? Some flowchart probably.

Below we present the flowchart that should help you guide the troubled waters of making well-founded LLM choices.

Do you need like a “NO”-flow for a flowchart to be an actual flowchart? Probably not, right? And if so, what should we call it instead?

Note that we explicitly distinguish knowledge and behaviour.
On the behaviour side of things, in line with what Andrej Karpathy stated back in May, we would suggest following approach in maturing your LLM use case:

Test the waters of your use case through: Zero-Shot Prompting
Explore what’s already possible if you collect a bit of data to demonstrate the behaviour you are looking for: Few-Shot Prompting
If Few-Shot Prompting isn’t cutting it, you can look into actually changing the model itself by providing it [prompt, response] samples: Supervised Fine-Tuning

Now imagine you chose a few-shot prompting approach and it works great but you have a gigantic amount of requests and things are becoming pricey? Then perhaps it may be interesting to host the LLM yourself and fine-tune it to reduce the amount of words pushed through the system with each call.

In terms of cost-efficiency considerations, we present a flowchart with an actual “NO”-flow.

Or what if you have a task that is so simple that you can easily get the right behaviour through zero-shot prompting but the cost is simply too high per task? Once again, self-hosting a much smaller, fine-tuned LLM may be the way to go. For more insights in that, we refer to our blogpost on the emerging space of foundation models in general.

‍

Concrete examples and use cases

In the style of a classic schoolkid bragging throwdown, we will walk you through some actual examples to demonstrate the intuition.

1. “You know MY company wants to use a Large Language Model to send a personalised welcome message to each new employee “

For this one, simple few-shot learning with some examples of nicely styled welcome messages combined with a template that loads the information for that employee, should do fine.

2. “Oh yeah? Well MY company wants to use an LLM to produce a set of typical navigation messages in the style of Severus Snape.

Depending on how much of a caricature this Snape🪄 guy actually is, you might get away with few-shot learning here. If his language style however is so creative that even 50 examples of Snape interactions won’t cut it, you might have to plunge into SFT with a more extensive dataset.
“Turn left. Do nót disappoint me”.

3. “Oh please, MY company wants to create a chatbot that sarcastically answers every general English question that a user asks ”

Modern LLMs understand language play to a sufficient extent for this chatbot to be built purely on zero-shot prompting.

4. “Wait till you hear this. MY company wants to create a chatbot that sarcastically answers évery East-Asian question that a user asks in the same language.”

Boy o boy will you have a hard time gathering enough data on all of those languages to sufficiently capture how sarcasm is typically transmitted. If you manage to get that data together; the doors of supervised fine-tuning will open for you. Note however that if your model has 0 ingoing knowledge of those languages, good performance will still be unattainable and you might have to wait for a gigantic dataset that allows a hero to leverage unsupervised learning to have a model get the hang of those more exotic languages.

5. “Is that all? MY company wants to leverage an LLM that, for a certain support ticket, automatically determines the support team to handle it. We have over 1000 support tickets every hour.”

Classification tasks (such as this routing one) have been around since the dawn of Machine Learning. Classically you would train a specific model for this and perhaps that is still your cheapest option but sure an LLM should also be more than capable enough. Depending on the complexities (range of questions topics, amount of support teams, input languages,…), we would expect this to work fine in a few-shot learning approach. Note however that because of the high throughput mentioned here, looking into self-hosting might be worth it in terms of cost-efficiency.

6.“Hold my juice box, MY company wants to build a model that doesn’t even care about routing the question, it just answers the support question straight away! ”

Aha, but to answer these questions you need knowledge right? A RAG-architecture (ad-hoc supplying the model with the relevant information) with few-shot learning to ensure adequate behaviour should be sufficient to enable this use case. Again, self-hosting deserves your consideration if there is a high demand for this model.
‍

Fine-tuning LLM's: open source or closed source?

This is definitely a pertinent question. If you finally decide that indeed you need to fine-tune an LLM, you have another choice to make: fine-tuning an open source model and hosting it yourself or, if available, use fine-tuning APIs provided by closed source model providers.

Fine-tuning open source models

Fine-tuning an open source model (e.g. Meta’s Llama 2 or TII’s brand new gigantic Falcon 180B) and hosting it yourself offers you some great advantages that typically come with full ownership.

Advantages of fine-tuning open source models:

Full control over privacy both during training and during use.
No external dependencies: you provide the model, you maintain it and you determine when and how changes to the model should be made.
Full code transparency to the extent that you have access to the entire codebase and the specific implementation of supervised fine-tuning.
Complete freedom in model choice. Because you can fine-tune whatever open-source model; you can pick the size that best suits your use case in order to optimise for cost, latency and processing power.

Disadvantages of fine-tuning open source models:

Time, cost and knowledge needed to host an LLM yourself and to set-up a fine-tuning pipeline on appropriate infrastructure.
Cost related to keeping things up and running yourself. You need a certain throughput to justify keeping a model highly available at all times.

For some basic insights on how to approach the fine-tuning process itself (i.e. changing the model weights), we recommend following summary on Parameter Efficient Fine-Tuning.

‍

Fine-tuning closed source models

As mentioned, some closed source model providers offer an API for fine-tuning (e.g. OpenAI’s fine-tuning API for GPT-3.5).

Advantages of fine-tuning closed source models:

Quick development : the effort is restricted to preparing your supervised fine-tuning dataset and offering it to the API.
No self-hosting complexities/costs: weirdly enough, OpenAI doesn’t charge you for keeping your fine-tuned “model” available for use.

Disadvantages of fine-tuning closed source models:

No transparency: many unanswered questions. Which techniques (such as the ones referred to above) are actually being applied? How are the model parameters influenced? Quite probably this API-based fine-tuning is very lightweight and the ceiling for the quality limit is likely a lot lower than when you handle fine-tuning yourself.
Classic closed source limitations: you pay per “word” and there is no getting around that, you are completely dependent on the provider, your data is leaving your environment over the API and so on.

‍

So it’s data? It was data all along? Yes. Always.

‍

Whichever approach you may choose, it should be obvious that what will directly impact the eventual performance of your fine-tuned model, is the data used to perform the fine-tuning.

Given this importance of the quality of your training data — for now and for the future — your main focus should be on setting up highly qualitative, reusable data pre-processing components. Open source initiatives such as Fondant aim to achieve exactly that: powerful and cost-efficient controlling of model performance through quality data. You can read more about that in this Blogpost on Foundation Models.

‍

Our conclusion on fine-tuning LLM's

What we've discussed:

The difference between Retrieval-Augmented Generation (knowledge impacting) techniques & Fine-Tuning (behaviour impacting) techniques (+ why it’s perfectly defendable to combine both techniques)
A high-level interpretation of what fine-tuning actually means in the LLM space (typically: supervised fine-tuning)

And then of course we also covered the flagship of this story:

A “flowchart” to help you think about when to apply which technique

‍

The final verdict:

Ultimately, we emphasised that when supervised fine-tuning is indeed the appropriate approach: you should put your eggs in the basket of quality data. That will enable valuable use cases for now ánd for the future.

For more flagship flowcharts and LLM news, stay tuned.
Or, even better, stay fine-tuned. 🥁

‍