May 17, 2023

Low Rank Adaptation: A technical deep dive

So what is Low-Rank Adaptation (LoRA) exactly?

In today’s fast-paced technological landscape, large AI models are propelling breakthroughs across diverse domains. However, tailoring these models to specific tasks or datasets can be a computational and resource-intensive endeavor. Enter LoRA (Low Rank Adaptation) — a groundbreaking and efficient fine-tuning technique that harnesses the power of these advanced models for custom tasks and datasets without straining resources or incurring excessive costs.

LoRA had suddenly taken the AI community by storm (Fig 1). In this blog post, we’ll delve into the reasons behind its meteoric rise. We’ll explore the principles underpinning LoRA, its effectiveness in various domains, and the impact it’s having on the open-source community.

‍

*Fig 1: Popularity of the term LoRA over the last 12 months within the Computer Science category. (Source*)

‍

Whether you’re an AI aficionado, an engineer seeking to capitalize on large models for your specific business challenge, join us on this captivating journey to discover how LoRA is transforming the fine-tuning step in the FMOps (Foundation Model Ops) pipeline for large AI models.

‍

Understanding linear algebra in Low-Rank Adaptation

Before diving into LoRA, let’s review some fundamental linear algebra concepts. If you’re comfortable with the basics of linear algebra (particularly matrix rank), feel free to bypass the math below.

Matrix Rank

The rank of a matrix is the dimension of the vector space generated by its columns, which is given by the number of linearly independent columns (or rows) in a given matrix. It can be proven that the number of independent columns (known as column rank) is always equal to the number of independent rows (called row rank). Hence, for a matrix A with m rows and n columns (represented as Aₘₙ),

‍

Types of Matrices

Based on its rank, a matrix can be primarily classified into two types.

Full-Rank Matrix

A matrix Aₘₙ is called a full-rank matrix if rank(A) = min(m, n). The matrix shown below is an example of a full rank matrix.

Rank-Deficient Matrix

The opposite of a full rank matrix is rank deficient i.e. rank(A) < min(m, n). The rank-deficient matrix shown below has a rank of 1, as the columns (or rows) of the matrix are not linearly independent of one another.

Low-Rank Matrix: A rank-deficient matrix Aₘₙ is called a low-rank matrix if its rank is significantly lower (no fixed threshold) than the minimum number of rows and columns. Mathematically, rank(A) << min(m, n).
‍

Relevant Properties

1. As previously mentioned, the rank of a matrix is constrained by the minimum of its number of rows and columns.

Prop 1: Rank of matrix is constrained by the minimum of its number of rows and columns.

2. Given matrices A and B with rank(A) = m and rank(A) = n, then

Prop 2: Rank of the product of two matrices is constrained by the minimum of their individual ranks.

An intuitive understanding of Matrix Rank: For the purposes of this blog, the rank of a matrix can be perceived as the dimensions of the feature space it represents. In this context, a low-rank matrix of a certain size encapsulated fewer features (or a lower dimensional feature space) than a full-rank matrix of the same dimensions.

‍

Rank Decomposition

Rank decomposition or factorization of a matrix Aₘₙ is the factorization of A of the form A = CₘᵣFᵣₙ where rank(A) = r. It can be proven that every (finite) matrix has a rank decomposition (proof). Techniques like SVD (Singular Value Decomposition) can be used to construct such a decomposition.

With that, we’ve covered the necessary background concepts. Let’s dive right into LoRA and explore how it leverages these principles in the context of fine-tuning large AI models.

‍

Low-Rank Adaptation of Large (Language) Models

LoRA is an efficient finetuning technique proposed by Microsoft researchers to adapt large models to specific tasks and datasets. While the paper uses GPT-3 as the test case and focuses on language models and NLP tasks, this technique is quite generalizable, as we will see below. It can be applied to various models in multiple contexts.

Hypothesis

Many previous works have shown that over-parametrized large models reside on a low intrinsic dimension. The main idea behind LoRA is that the change in weights during model adaptation also has a low intrinsic rank/dimension. Concretely, if Wₙₖ represents the weights of a single layer and ΔWₙₖ represents the change of weights during model adaptation, the authors propose that ΔWₙₖ is a low-rank matrix i.e.

Why does this make sense?

Large models are trained to capture the general representation of their domain (language for LLMs, audio + language for models like Whisper, and vision for image generation models). These models capture a variety of features which allow them to be used for diverse tasks with reasonable zero-shot accuracy. However, when adapting such a model to a specific task or dataset, only a few features need to be emphasized or re-learnt. This means that the update matrix (ΔW) can be a low-rank matrix.

Method

The technique constrains the rank of the update matrix ΔW using its rank decomposition. It represents ΔWₙₖ as the product of 2 low-rank matrices Bₙᵣ and Aᵣₖ where r << min(n, k). This implies that the forward pass of the layer, originally Wx, is modified to Wx + BAx (as shown in the figure below). A random Gaussian initialization is used for A and B is initially to 0, so BA=0 at the start of training. The update BA is additionally scaled with a factor α/r.

Fig 2: Modified forward pass using low-rank decomposition.

Practical Benefits
‍

Reduction of training time and space: Using the technique shown above, r(n + k) parameters have to be tuned during model adaption. Since r << min(n, k), this is much lesser than the number of parameters that would have to be tuned otherwise (nk). This reduces the time and space required to finetune the model by a large margin. Some numbers from the paper and our experiments are discussed in the sections below.
No additional inference time: If used in production, we can explicitly compute W’ = W + BA and store the results, performing inference as usual. This guarantees that we do not introduce any additional latency during inference.
Easier task switching: Swapping only the LoRA weights as opposed to all the parameters allows cheaper and faster switching between tasks. Multiple customized models can be created and swapped in and out easily.

However, the drawback here is that once the weights are merged to remove additional inference time, the ease of task switching vanishes. Also, it is not straightforward to batch inputs to different tasks with different A and B in a single forward pass. You win some and you lose some, right?
‍

Evaluating the effectiveness of LoRA in practice

Having discussed how the technique works and its possible benefits, let’s now explore its efficacy. In the paper, the authors evaluate the performance of LoRA (using RobBert, GPT2, and GPT3) against full finetuning and other parameter/compute efficient techniques. They find that LoRA generally outperforms other efficient finetuning techniques by a significant margin while also providing comparable or better performance than full finetuning. For complete details of their analysis, interested readers can refer to the paper.

To further explore its effectiveness, we conducted additional experiments in various domains and tasks. In the following subsections, we discuss the results of these experiments, showcasing the versatility and robustness of the LoRA method.

Finetuning Whisper-Large-v2 on Common Voice (NL)
‍

Whisper is an ASR system which has been trained on a large corpus of data. It is a family of models, each of varying sizes. The smallest model (Whisper-Tiny) contains 39 million parameters whereas the largest model (Whisper-Large-v2 ) contains 1.5 billion parameters. The largest model can perform multilingual automatic speech recognition (ASR). However, its performance can be improved for a particular language by fine-tuning it with data from that language.

In this experiment, we finetune the model using the Dutch language subset of the Common Voice Dataset. We finetune the large model (with & without LoRA) normally and in a low data regime. Finetuning the model using LoRA with r=32 (where r is the rank of the update matrix) reduces the number of tunable parameters to 15.7 million, which is 1% of the parameters of the entire model.

‍

Low Data Regime: Using 1 hour of audio data

We finetune Whisper-Large-v2 with and without LoRA on one hour of audio from the Dutch subset of Common Voice. The evaluation results are shown in the table below.

*Table1: Comparison of LoRA and full fine-tuning in a low data regime*

We see that the performance of the model finetuned using LoRA is similar to the performance of the fully finetuned model. However, as mentioned previously, LoRA allows us to accomplish this in a much shorter time by tuning a minuscule number of parameters (for size, the LoRA checkpoint is only 60MB). Using LoRA (and 8bit optimization), the training took ~4 hours and cost < $5on an Nvidia T4 on Google Cloud.
‍

Using a large dataset

In this case, we finetuned the same model with the entire Dutch subset of Common Voice (~40 hours). The evaluation results are shown in the table below.

*Table 2: Comparison of LoRA and full fine-tuning using the entire common voice dataset. The full fine-tuning results for the Large-v2 and Medium models are sourced from the HF leaderboard* *(source)*.

We see that the performance of the large model finetuned with LoRA (for 5000 steps) is comparable to the performance of the fully finetuned large and medium models. However, using LoRA (and 8bit optimization), the finetuning took ~10 hours and cost < $10 on an Nvidia T4 on Google Cloud.

We’ve explored the world of Whisper and how it performs when finetuned with datasets of varying sizes in another blogpost. You can find it here!

‍

Adapting LLaMA to perform a dialogue summarization task

LLaMA is a large language model released by the researchers at Meta. Like Whisper, it is a family of models with varying sizes (7B being the smallest and 65B the largest).

We finetuned the 7B parameter model on a dialogue summarization task using the Samsum dataset and used ROUGE to evaluate the finetuned model. To test the effectiveness of LoRA at low ranks, the rank of the update matrix was constrained to 4 i.e. r=4. This meant that the number of tunable parameters was 2 million, which is just 0.03% of the total number of model parameters.

From the table, below we see that it outperforms a fully finetuned Flan-T5-Base model (250 million parameters). Additionally, using a low rank and 8-bit optimization allows us to finetune such a large model on a single Nvidia-T4!

*Table 3: Low rank fine-tuning of LLaMA on the SamSum dataset. The score for Fan-T5-Base is obtained from* *here*.

Note: This isn’t a fair standalone comparison between models. LLaMA-7B is a much larger foundation model in comparison to Flan-T5-Base and therefore is probably capable of better zero-shot performance on many tasks. However, this comparison aims to demonstrate that for large foundation models, using a very low rank (and thus low compute and time for finetuning) suffices.

The results from the paper and our experiments demonstrate LoRA’s effectiveness. LoRA provides a compute and parameter efficient method to finetune foundation models without a significant drop in the performance saving both time and money!

LoRA in the open-source community

Let’s now look at how LoRA is being used in the open-source community. With the recent explosion of large foundation and generative AI models, the open-source community has welcomed LoRA with open arms due to its ability to allow low-resource practitioners to adapt large models. Here, LoRA is primarily being used for two major purposes: Instruct-tuning LLMs and Finetuning Diffusion models.
‍

Instruct-tuning Large Language Models

With the launch of ChatGPT and techniques like Self-Instruct, the OSS community has been steadily working on tuning large language models to follow instructions.

The core idea here is simple. Create a dataset of instructions and responses (either using manual curation or ChatGPT) and use LoRA to finetune a pre-trained large language model using this dataset. This method produces models that are reasonably adept at following instructions and answering questions like humans. Interested readers can check out models such as Alpaca-LoRA and Vicuna.

*Vicuna* *responding to a user's question regarding a holiday in Hawaii (Source*).

Fine-tuning Stable Diffusion

Before the launch of ChatGPT and other LLMs like LLaMA, LoRA was primarily used to tune stable diffusion to adapt the style of generated images. The LoRA weights can then be used and shared in a plug-and-play fashion switching them out when a different image generation style is necessary.

As seen before, the main draw of this technique is its parameter and compute efficiency. A testimony to the popularity of this method in the generative AI community is the existence of the Lora Library where people can share their Lora files!

‍

Conclusion

To sum it all up: LoRA has two major applications. The first is to finetune large models with low compute, and the second is to adapt large models in a low-data regime. Results from the paper, our experiments and the widespread adoption by the open-source AI community demonstrate its value in the current foundation-model-driven AI environment.

It democratizes AI, empowering individuals and organizations to use and tune large foundation models without breaking the bank, ensuring that the ability to adapt these models is not just in the hands of a select few!

‍