Analysts believe that we are entering the industrial era of artificial intelligence. Foundation models (FMs) — large pretrained AI models that can easily be adapted to new use cases — are revolutionizing creative work and are expected to augment or take over ever more knowledge work in the coming years as ever more use cases in different industries are being tackled by FM-based AI.
Companies in the creative and knowledge industries are scrambling to develop a strategy as they sense that their business may well be in for a rollercoaster ride in the coming years. Customers are reporting to us that, for some tasks, their workers are already up to five times more productive using off-the-shelf generative AI tools, performing a week’s work in a day. No one wants to stay behind but for many it is not clear where or how to start.
In this blogpost, we try to make sense of some of the developments we see occurring around us and propose a general strategy of how to think of and approach the ongoing FM revolution. We will first have a look at FMs themselves and how they are different from what came before. After that we will look at MLOps and how it is giving way to Foundation Model Ops (FMOps) which is much more about aligning model output rather than just performance and stability. Overall, we believe that it is possible for companies to set up data and AI infrastructure now in such a way that they will be able to jump on any of the innovations that are bound to occur. The key will be to put solid internal data management in place and start optimizing internal processes. FMOps presents a key to set this evolution in motion.
Gradually, then suddenly: Foundation Models
The term Foundation Model (FM) was coined in a 2021 report by researchers from Stanford University and defined as follows:
A foundation model is a machine learning model trained on broad data at scale such that it can be adapted to a wide range of downstream tasks
While most agreed that this shift was indeed taking place, many in the industry downplayed its significance as these new models were initially confined to research labs and occasional demo applications. With the advent of models like ChatGPT, however, it has become clear to everyone that we are witnessing a fundamental paradigm shift. Previously, machine learning models were trained to do a specific task and then chained with other models and business logic to make decisions. FMs, however, are multi-billion-parameter models pretrained on terabytes of often multimodal data (e.g. text and images) using gigantic amounts of compute (e.g. LLaMA: 118 gpu-years) which are then guided to perform complex tasks relatively independently.
The Stanford researchers point to emergence and homogenization as useful concepts for understanding the ongoing shift. As models grow larger and they are trained on more data, they start to display emergent behavior. This means that, while they have been trained to perform a very simple task such as predict the next word or remove the noise from this image, they will develop complex behaviors to do so. These more complex patterns like reading comprehension (resembling human reasoning) or drawing like Van Gogh (resembling human creativity) are never explicitly trained. They just emerge from learning to reconstruct the data.
Partially related to this is the fact that the shift towards foundation models comes with a tendency towards homogenization: less diversity in the models that are being used. As large models are expensive to train and can be adapted downstream to perform a range of tasks, in the future, the industry will likely rely on a limited number of foundation models driving a wide array of applications. This comes with certain risks regarding societal bias and misinformation. Hence, in the future one of the main challenges for machine learning professionals will be to align model behavior not just in terms of performance at certain tasks but also in terms of norms and values and human expectations in general.
Playing hard to get
In the past few years, several dozen FMs have been developed, most of which were generative AI models ‘translating’ from one modality to another, e.g. text to text (GPT), text to image (DALL-E), image to text (BLIP), speech to text (Whisper), text to 3D (DreamFusion), text to short video (Make A Video), text to longer video (Phenaki), video to video (Gen1) and text to 3D video (Make a video 3D). Connecting text and images (CLIP) and Segmentation (SAM) are two examples of other tasks that were tackled by FMs.
These FMs are typically ‘released’ in one or several of three ways:
- Scientific paper: most FMs (by Meta, Google, Salesforce) are described in a scientific paper. Sometimes they are not made available in any other way which implies that they can only be used in applications when they are reimplemented based on the paper, e.g. by the open source community as in the case of Google Imagen / DeepFloyd IF.
- API access: paid or free API access through which you can interact with the FM: typically the models by OpenAI. Sometimes there is also a possibility to fine-tune the model on custom data, also through an API. Control is limited and prices can be steep however.
- Open Source: code for running and fine-tuning and weights are made available which can be used relatively freely, e.g. models by Meta, Salesforce, Stability AI, Hugging Face, research institutes, open source organizations (e.g. LAION, Eleuther). The main issue to take into account here is the license under which which can be restrictive (e.g. only for research) or permissive (also allowing commercial use).
It is unclear at this point which will be the paradigm under which most models will be made available as a growing number of FMs enter the market. Competition between different cloud and model providers will play an important role in this respect, as will regulation probably, as exemplified by the recent amendments to the EU AI act and the US Senate hearings. Based on current offerings and announcements from various cloud providers, the most likely scenario is that there will be a spectrum of configurations from very limited control (simple prompting through an API) to fully open access to code and weights for customization and fine-tuning.
Choosing a Foundation Model
Previously, when building custom models, performance was determined by data availability (quantity and quality), architecture and hyperparameter tuning. Today, with FMs we see it largely come down to two largely independent factors:
Base model performance: itself determined by
- Model size (number of parameters)
- Training duration
- Dataset size and quality
Fine-tuning performance: itself determined by
- Fine-tuning regimes (combination of): self-supervised, supervised, reward-based, …
- Dataset quality and size (multiple)
Choosing a base model directly affects your system performance and running cost. Selecting a 33B parameter large language model for your setup will likely improve performance, but it will also require more expensive infrastructure. Interestingly, we see a tendency towards convergence of base models in terms of model architecture, size and even training dataset. Conceivably, in the future, we will end up with a range of base models which are very similar and which will compete with each other in areas other than performance such as price and licensing.
Hence we believe that, as we have seen with ChatGPT which went through multiple stages of supervised and reward-based fine-tuning, that the subsequent fine-tuning steps will become even more important determinants of downstream task performance. Likely, fine-tuning itself will be further broken down into upstream fine-tuning by model providers and downstream fine-tuning on proprietary data and specific tasks by users. Hence we advise our customers to primarily invest in downstream fine-tuning and general alignment power while keeping other options as open as possible.
Apart from performance and running cost, the main factor in choosing a foundation model is how easy and cost-effective it is to build a system around it that satisfies your needs.
In our next blogpost, we go deeper into what it means to set up Foundation Model Operations or FMOps.