A data centric framework for finetuning foundation models
Foundation models bring unprecedented opportunities for increasing productivity and enabling growth. Out of the box, they are able to perform generic tasks such as generating text, images and videos. However, in order to perform well on complex and domain-specific tasks, such as generating images in a certain visual style or working with legal or medical jargon, they need to be specialised.
Fondant has been developed to make this process, called fine-tuning of foundation models, as easy and performant as possible.
Foundation models are models that are trained on large and diverse data sources and can be used for a wide range of downstream tasks. As such, they form the "foundation" for other models.
For example, GPT3 is used as the foundation for ChatGPT, a model adapted for question-answering. Other examples of foundation models include Stable diffusion, CLIP, SegmentAnything (SAM), and many more.
Fondant is an open source framework for data preparation and fine-tuning of foundation models, developed by ML6 together with the open source community. Our goal is to make it easy and efficient to fine-tune large foundation models based on specific knowledge domain data.
Data quality and quantity are the main factors determining the power of fine-tuned AI models. However, preparing the data often takes 80 to 90% of the budget in real-world scenarios. Through Fondant we aim to make this process as painless as possible by providing an easy-to-use programming interface, composable pipelines and reusable components which can process terabyte scale data loads in hours.
Visit our Github page and start testing and contributing to Fondant! On Github, you can find all information on how to install, test and create your own pipelines and components. Share your feedback with us - we are continuously adding features and components based on our user’s needs.
A model’s performance is directly determined by the quantity and quality of data on which it was fine-tuned. Fondant makes it easy to collect, enrich and curate large-scale data for fine-tuning.
Fondant is compatible with data and model hubs, for example model hubs such as Huggingface. It supports all major clouds, giving you freedom and control and avoiding vendor lock-in. We also aim to support all data modalities (images, text, video, …) to enable fine-tuning of any foundation model.
Fondant makes it possible to create highly scalable pipelines of reusable components for enrichment and fine-tuning of large foundation models. It facilitates the smart collection, filtering and transformation of data and optimises fine-tuning. Fondant is easy to reuse and extend.
For optimal performance, foundation models need large amounts of data to be fine-tuned. Therefore, Fondant is built to scale. In future releases, we aim to enable fine-tuning or even training of large models through distributed compute and highly scalable pipelines.
Fondant is designed with datasets as the interface and built around a central manifest. This enables write-once-read-many and minimal data movement, reducing cost.
Large Language models (LLMs) such as BERT or GPT tend to struggle when dealing with domain-specific language, for example in legal texts. For this reason we used Fondant to prepare large Dutch and French datasets of millions of documents and fine-tuned general BERT models for legal language. This resulted in a 25% boost in performance for tasks such as entity extraction and semantic search and enabled us to build an AI driven knowledge engine for notaries.
AI image generation models such as Stable Diffusion can create images of nearly anything, but they struggle to provide consistent quality when creating images in a specific style, such as clipart. In this case, fine-tuning on specific data is needed. Using Fondant, we collected and prepared a large dataset of carefully selected clean cut clipart images and fine-tuned Stable Diffusion for control of style, variability and quality while removing the need for elaborate prompt engineering. This resulted in a clipart generator tailored to the specific needs of the target audience.
Mastering a specific task, such as generating guided realistic-looking interior designs, requires fine-tuning on specially prepared domain data. Fondant makes it easy to gather, filter, and enrich such data to create a stable diffusion-based model that allows you to redesign your interior in seconds. Click here for a demo!
We currently have four locations across Europe and we cannot wait to impress you. Let us know how we can help.