Medha Hegde
Machine Learning Engineer
This blogpost is aimed at those who want to understand how artificial intelligence is being implemented in the field of biology, specifically with regard to proteins. We give a brief overview of what proteins are, their characteristics and the applications of protein engineering. The potentials of AI in this area are explored by giving an overview of current state-of-the-art models that are involved in solving various protein-related problems.
Proteins are the essential building blocks of life and are omnipresent. More often than not, they play an essential role in the functioning of every living thing. Proteins are large and complex molecules, and enzymes are a subgroup of proteins that can speed-up chemical reactions such as hydrolysis, condensation or hydroxylation. There are estimated to be around 75,000 different proteins in the human body that keep it running.
Over the span of millions of years, nature has designed a complete toolbox of proteins that drive different facets of life: conversion of sunlight into high energy molecules, breaking down molecules for energy, building cell organelles building blocks, etc. Through natural evolution, nature has optimised these proteins to perform these tasks in the most efficient way.
Today, we face certain problems that have arisen due to environmental pollution or new diseases with increased life expectancies, for example. Very often, enzymes, because of their natural way of working and composition, can be at the core of the solution to these problems. For example, newly developed, short-lived enzymes could break down plastic in a process that is very much a natural process.
While evolution will, for some cases, ensure that these enzymes start to emerge, the process could be sped up if we, as humans, could speak “protein”, and design the necessary enzymes ourselves. Understanding and predicting the interplay between sequence, 3D structure as well as function is crucial to be able to build these enzymes with desired functions. The structure and the function are both embedded in the base primary sequence of the protein.
Proteins are composed of ten to multiple thousand building blocks, linearly chained together to form a string. These building blocks are amino acids and there are 20 naturally occurring amino acids. The composition and the order of the linear string contains all necessary information for the 3D structure of the protein and thus also its function. However, multiple levels of protein organisation sit between the linear amino acids string and the 3D protein structure.
The primary structure, as described above, refers to the linear sequence of amino acids and is one-dimensional. Parts of this chain regularly fold or arrange themselves in a predefined way to form components, such as an alpha coil or a flat beta sheet, which is known as the secondary structure. The order of the amino acids fully determines the formation of this secondary structure. The subsequent folding of these components creates the overall 3D shape of the protein, called the tertiary structure. The quaternary structure is formed when multiple proteins start to organise in an ordered way, but this does not occur with every protein. Starting from the tertiary structure, proteins can have a biological function.
The 3D structure determines the chemical reactions that the enzyme can perform. Every enzyme possesses a specialized active site where catalytic reactions occur. This portion of the enzyme is characterized by its unique shape and functional groups, allowing it to securely interact with the molecules involved in the reaction, known as substrates. Consequently, the active site comprises a limited set of catalytic amino acids that play a crucial role in facilitating the reaction.
It is important to understand that the folding of proteins is a well regulated process, and that the folding blueprint is fully embedded in the original amino acid chain. Therefore, the linear amino acid chain is information complete.
The holy grail of protein design is jumping from sequences to function, and reverse, from function to sequence. Based on the sequence, we could understand what the protein does and how it behaves. But more importantly, we could obtain a protein sequence that fulfils a specific, desired function. However, this is a very challenging objective, and recent developments have focussed on the intermediate step towards structure.
The primary structure of a protein (“Sequence” in the image above), i.e, the linear chain of amino acids, determines its native state (“Structure” in the image). This folding process by which the protein reaches its final unique form is not fully understood and is known as the “protein folding problem” (green arrow). The reverse of this process is known as “inverse-folding” (red arrow). Protein Function, i.e., the biological process the protein performs is determined by its 3D structure, which in turn is dependent on the primary structure. As seen in the image above, these direct and indirect connections between the three, are functional processes that could be modelled.
This primary structure is observed by a process called protein sequencing, referring to the amino acid sequence that makes up the protein. The tertiary structure of a protein is measured by experimental methods which are expensive, time consuming and applicable to all proteins; only ~170k 3D protein structures have been so determined while about 250 million proteins have been sequenced. Methods to model this process of protein folding would help us understand the elementary units of life and facilitate faster and more sophisticated drug exploration.
Since physical measurement of every protein structure is not feasible with the current state of equipment, computational methods have been used to attempt to predict the structure instead. The final structure of a protein is a function of its amino acid sequence, so this function can be modelled using such prediction methods. This is where artificial intelligence comes in. Using deep learning methods, the structure of a protein has been shown to be successfully predictable with more accuracy than any other prediction methods.
Notably, in 2020, Google’s DeepMind used a model called AlphaFold to achieve breakthrough results and they claimed the protein folding problem to be “solved”. There have been many, many other deep learning models since then that work on protein folding as well as on other protein-related areas of research that we will discuss further. In the upcoming sections, recent models that have shown promising results will be described. They cover the tasks of Protein Language Modelling, Structure Prediction, Inverse Folding, Function Prediction and Protein Design.
We first start with Protein Language Models (PLMs) since they are used to represent protein sequences in the form of embeddings. Embeddings are mathematical vector representations of protein sequences that contain information about the structure and function of the protein itself (see this paper for more information). These embeddings can then be used in the subsequently described Structure, Sequence and Function Prediction models.
Large Language Models (LLMs) are able to model natural language structure and grammar simply by training on large amounts of text data. They have been shown to be very useful for tasks such as text generation and translation, with bigger and bigger models being released over time with improved capabilities and applications. PLMs aim to do the same and learn the evolutionary patterns and principles that guide the functioning of proteins by training on large amounts of protein sequence data. Protein sequences could be considered the “words” in the language of biology. We give an overview of the ProtTrans, ProteinBERT, ProGEN2 and the ProtGPT2 models.
In this 2020 paper, 6 LLM architectures (T5, Electra, BERT, Albert, Transformer-XL and XLNet) were pretrained on raw protein sequences and were shown to be able to capture features of amino acids, protein structure, domains and function. The models are available here and can be used to extract features, fine-tune models, predict secondary structures and sequence generation.
ProteinBERT was released in 2021, a model that uses the classic BERT architecture and was pretrained on 106 Million sequences on two tasks: bidirectional language modeling and GO (Gene Ontology) annotation of sequences and thus took as input protein sequences and GO labels. Despite its smaller size, ProteinBERT exhibits comparable and sometimes exceeding performance of larger models such as ProtT5.
This 2022 PLM from Salesforce is a Transformer-based model trained on billions of protein sequences to predict the next token in the sequence autoregressively. Its predecessor, ProGen, was the first decoder-only model trained specifically for protein sequence design. The model comes in 4 different size variants (the largest one is mentioned in the table above), and is able to capture the distribution of observed proteins and generate new protein sequences. These generated sequences resemble existing ones but may not actually exist in nature. This allows for protein engineering and creating protein structures that perform specific functions. The model is freely available and has been shown (using AlphaFold) to generate sequences that fold into well-formed structures.
Also released in 2022, ProtGPT2 is similarly capable of modeling protein sequences using an autoregressive GPT2-like Transformer architecture. It is a smaller model that’s been trained on 50 million sequences. It is capable of producing proteins within uncharted areas of the natural protein landscape, while still exhibiting characteristics that closely resemble those found in nature.
This family of model was released along with ESMFold (detailed in later section) which is a structure prediction model. It is an encoder-only transformer model, and the largest variant is the largest protein language model currently available. This enabled it to outperform other PLMs on structure prediction benchmarks. It was trained on 65 million unique protein sequences.
Models attempting to “solve” the protein folding problem as described above are involved in predicting the structure of a protein from its amino acid sequence. There have been many models that have built upon the work of AlphaFold to predict structures using different methods. Here we explore AlphaFold, RosettaFold, OmegaFold and ESMFold.
As mentioned above, DeepMind’s 2020 AlphaFold model is a deep-learning architecture that predicts with high accuracy the 3D structure of a protein based on its amino acid sequence. The 3D structure is modelled as a graph and the prediction itself is modelled as a graph inference problem. It leverages evolutionary information of related proteins to be able to predict the 3D-coordinates of the final structure using a transformer-based architecture. It is trained on publicly available datasets such as the Protein Data Bank and UniProt, while also incorporating structures predicted with high confidence back into the model to make use of unlabelled sequences. The model was publicly released and they also released all the predictions made by the model to create the AlphaFold Database of the 3D structures of almost every protein sequenced to date. At the time of its release, it became the state-of-the-art for protein structure prediction from amino acid sequences, with particularly good predictions for sequences with homologues.
In 2021, a model named RoseTTAFold that similarly predicts protein structures was released from the Baker Lab. It differs from AlphaFold in that it is a “three-track” network as it simultaneously looks at the primary and tertiary structures and the 2D distance map during training and prediction, and it is also able to model protein complexes. It comes close to AlphaFold’s performance on many benchmarks. They both rely on Multiple-Sequence Alignments (MSAs) which leverage similar sequences or homologues and hence do not perform as well on sequences that do not have MSAs.
OmegaFold uses a large pretrained protein language model (OmegaPLM) to predict tertiary structure using an alignment-free methodology, i.e., without the need for MSAs. It is able to make predictions based on only a single protein sequence. Similar to how language models like GPT-4 are able to learn language structure and form just by processing large amounts of text data, the protein language models learn analogous structural information by training on large amounts of protein sequences (the sentences in natural language). Unlike in natural language, the protein structure involves the 3D world, and hence geometric intuition is incorporated using a vector geometry transformer in the architecture. It matches the performance of AlphaFold and RoseTTAFold on the CASP and CAMEO datasets while outperforming them both on single sequences. Because it doesn’t rely on MSAs or known structures, it is about 10 times faster than them.
In 2022, Meta AI unveiled their ESMFold protein structure prediction model that also makes use of a large (the largest, in fact) protein language model, ESM-2. As in OmegaFold, the model does not require MSAs and outperforms AlphaFold and RoseTTAFold on single sequences. The largest model in their ensemble of models is about 150x the size of Alphafold, reporting a 60x inference speed increase over the previous models for shorter sequences. Owing to this speed increase, a large metagenomic database called the ESM Metagenomic Atlas was created that reveals structures at the scale of hundreds of millions of proteins.
The reverse process of protein folding, termed as inverse folding, starts at a specific target protein structure and searches for the protein sequence/s that folds into that structure. A solution to this problem would aid in the de novo design of proteins: designing new protein sequences that fold into a specific structure to perform a desired biological function. For example, we could design proteins that have a certain structure to enhance T cells such that they are able to better fight cancer using inverse folding models². Similar to the protein folding problem, several AI models are able to model this reverse process to generate protein sequences conditionally. Here we focus on the ESM-IF1, ProteinMPNN and MIF-ST models.
In 2022, the ESM-IF1 model was shown to be able to predict protein sequences from the 3D coordinates of the protein’s tertiary structure. Since the size of the existing sequence-structure database was very small, only 16k structures, they augmented this data by adding 12 million predicted structures using AlphaFold. The problem was modelled as a seq2seq task between amino acid sequences and structures by maximising the conditional probability of a sequence given the structural coordinates. A generic Transformer was used for this task, together with a GVP-GNN (Geometric Vector Perceptron-Graph Neural Network) for geometric feature extraction.
Also in 2022, again from the Baker Lab, ProteinMPNN was shown to be able to model the inverse folding process by training an autoregressive model on experimentally determined structures. The model follows an encoder-decoder structure where the inputs to the encoder are the distances between the elements that form the protein structure for it to produce graph node and edge features. The decoder then uses those features to generate amino acids iteratively. They evaluated the generated sequences by predicting the structure and comparing it to the original structure. Of significance is that the predicted sequences were also experimentally evaluated on the tasks of protein monomer design, nanocage design and function design and were shown to be robust and accurate. Several “failed” designs were successfully recovered by ProteinMPNN.
Released this year in 2023, the MIF-ST (Masked Inverse Folding-Sequence Transfer) model leverages a structured GNN-based masked-language model. The outputs from this masked-language model trained only on protein sequences are inputted to this MIF-ST model to be pretrained conditionally on structures. Here, inverse folding is only used as a pretraining task to be able to perform well on downstream tasks such as creating functional homologues by inverse folding a protein’s structure and then sampling the sequence space. It has also been shown to be able to predict effects of mutations.
Protein function refers to the biological process it performs. This process is largely determined by its tertiary structure which in turn is determined by the primary sequence of amino acids. Being able to know the function that a particular protein sequence has would be very helpful in understanding the behaviour of biological systems. Protein function is generally expressed by a classification system such as the Gene Ontology (GO) that classifies proteins based on their function and their intracellular location, and the EC (Enzyme Commission) number that classifies enzymes based on the chemical reactions they catalyze. Below we take a look at the DeepGO, SPROF-GO, DeepFRI, GAT-GO and ProtNLM models.
Released in 2018, DeepGO introduced an approach to forecast protein functionalities by leveraging protein sequences. It employed deep neural networks to acquire insights from both sequence data and protein-protein interaction (PPI) network data, subsequently organizing them hierarchically according to GO classes. A CNN is used to obtain embedding representations of protein sequences, after which a classification model is used to refine features for each class and lastly, uses a model to allow for multi-model data integration. This model was state-of-the-art at the time of its publication as a sequence-based protein function prediction tool.
This 2019 model predicts protein function as represented by both the GO class and the EC number using protein structure and features extracted from protein sequences. For this, an LSTM protein language model is used to obtain residue-level features from the sequences. A GCN (Graph Convolutional Network) is then used on these features to construct protein-level features and predict probabilities for each function. Performance is enhanced by including predicted structure during the training process.
The GAT-GO model is similar to the DeepFRI model but it uses a GAT (Graph Attention network), a type of GNN that uses self-attention, instead of a GCN. Additionally, instead of the LSTM language model, the pretrained large protein language model ESM1 is used to extract features. The GAT-GO model is shown to outperform the existing function predictors by making use of high-capacity pretrained protein embeddings, predicted protein structure and sequential features.
Released in 2022, the SPROF-GO is a sequence based, MSA-free protein function prediction model. It predicts the GO classification directly from the protein sequence. The architecture consists of a pretrained T5 protein language model, the embedding matrix of which is fed to two Multi layer Perceptrons (MLPs) to produce an attention vector and a hidden embedding matrix. Since the GO classification system is structured in a class-subclass manner, the function prediction problem is modelled as a hierarchical multi-label classification task, with the classes arranged as a Directed Acyclic Graph (DAG). This model beats all other state of the art models and can also generalise to non-homologous and unseen proteins.
This natural language processing model was developed in 2022 by Google Research in partnership with EMBL’s European Bioinformatics Institute (EMBL-EBI). With a different approach to describing protein function, the ProtNLM model uses a Transformer architecture to accurately predict a natural language description of the function of a protein from its primary sequence. The functioning of this model is analogous to that of an image-captioning model, where instead of an image we use a protein sequence. This model is now used by UniProt in their automatic annotation pipeline to add descriptions and names for ~49 million uncharacterised protein sequences.
Here, we describe two models: RFDiffision and ProT-VAE, both involved in protein design, which is the design of proteins with the purpose of it performing a specific function. Some of the models described above are used to facilitate this process.
De novo protein design aims to design novel proteins with a specific target function or structure. The RFDiffusion model uses a DDPM diffusion model, inspired by image generation models like DALL-E, along with RoseTTAFold, to perform protein design and generate new, diverse protein structures. The process involves first generating a random protein backbone using RFDiffusion, then using ProteinMPNN to design a sequence that folds to this backbone structure, and finally evaluating the generated structure using AlphaFold. It also allows for conditioning to be able to generate, for e.g., a protein with high affinity binding to a target protein or a diverse protein assembly with a desired symmetry. RFDiffusion is able to design proteins previously unobserved in nature.
ProT-VAE is a deep generative model that is able to generate diverse protein sequences from specific families with high functionality. The model’s architecture sandwiches a Variational Autoencoder model in between ProtT5 encoder and decoder blocks. The inputs to the model during training are unaligned protein sequences. The VAE is trained on specific families of proteins while the ProtT5 model was trained on millions of protein sequences, after being initialised using T5 NLP weights. The ProT-VAE model is able to furnish “data-drive protein engineering”, and is available on NVIDIA’s BioNeMo framework (claimed to be soon open-sourced).
We can see that the past few years have seen a major burst in protein-related AI research and model publications. The potentials for applications in the fields of drug design, antibody engineering and design, vaccine development, disease biomarker identification and personalised medicine (to name a few) are unlimited. Understanding proteins and their functioning with the combined use of Protein Language Models, Protein Structure Prediction models, Inverse Folding models and Function Prediction models can facilitate protein design with transformative effects. We will continue to monitor advancements while converting the most recent research into valuable applications within the field. If you are interested in our relevant current work, check out this press release!
For more information, you can contact me here: [email protected]