Listen to this article:
From predicting the impact of genetic mutations on protein function to unraveling complex drug-target interactions, AI and specially trained large language models known as BioLLMs are transforming early-stage drug discovery. Our previous article laid out the basics of exactly how BioLLMs drive these remarkable innovations.
While BioLLMs show immense promise, they rarely work directly out of the box for solving complex real-world problems in drug discovery. To do so they require fine-tuning by the kind of specialized software engineers found here at Loka, where we’re exploring a wide range of BioLLM architectures with an eye toward productionable use cases.
In this article, we present two exemplary architectures aimed at engineers and CTOs at biotech companies of all sizes. (They’re admittedly quite technical and benefit from preexisting understanding of the subject matter.) Our goal is to demonstrate that these models can lead to new avenues to explore and ultimately to new, groundbreaking medications to bring to market.
Single-Input Predictions: From Variant Effect Prediction to Gene Expression
BioLLMs are trained using masked language modeling, where the objective is to predict the likelihood of masked tokens—in this case, amino acids—within a biological sequence. This functionality can be adapted to specific computational tasks. For instance, variant effect prediction (VEP), which assesses how amino acid mutations affect protein function, can be framed as estimating the probability of a mutant amino acid occurring at a given position in the sequence. This approach is effective because foundational models, such as ESM1b, developed by Meta, have been shown to accurately distinguish between pathogenic and benign variants based on the predicted likelihood of each mutation’s impact. Dysfunctional mutations tend to be less conserved in evolutionary processes, creating a strong correlation between mutation likelihood and pathogenicity. Therefore, we have successfully applied ESM in a zero-shot fashion (i.e. to perform a task for which it has not been specifically fine-tuned for) to predict the effects of mutations (Figure 1).
Rather than using BioLLMs directly as they are, we can fine-tune them to better adapt their learned representations to our specific data. This is achieved by unfreezing some of the layers and adding a task-specific head,such as a regression or classification layer, to leverage the BioLLM’s power for specific downstream tasks. For instance, we can fine-tune DNABERT-2 to predict gene expression levels for different DNA promoter sequences (Figure 2). This approach is useful in drug discovery for tasks like identifying regulatory elements of target genes, improving therapeutic protein production and analyzing drug-induced gene expression changes.
Multiple-Input Predictions: From Siamese Networks to Dual-Model Architectures
When the desired task requires modeling the relationship between two entities, as in the case of protein-protein interaction (PPI), two input sequences must be simultaneously fed to the model. Siamese architectures (sometimes called twin neural networks) allow for the learning of shared embeddings by working in tandem on two different input vectors to compute comparable output vectors. In the use case below, we used ProtBERT pre-trained model to process both proteins, generate embeddings for each, and then combine them for classification and predict if the two proteins interact (Figure 3).
The shared weights in the Siamese network are important because they ensure consistent and comparable representations by applying the same transformations to both inputs. This not only simplifies the learning process but also reduces the number of trainable parameters, making it efficient when dealing with inputs of the same type. However, the situation changes when dealing with different types of inputs, such as in drug target interaction (DTI) prediction. In these cases, the inputs consist of a protein and a molecule that are distinct modalities, so the model weights cannot be shared. To handle this information, separate pre-trained models are required for each input type, such as ESM for proteins and MolFormer [3] for molecules. We can either fine-tune these models by unfreezing their weights or use them to extract embeddings, which will serve as input feature vectors. These embeddings can then be concatenated and fed into a classification head. Because BioLLMs are trained on millions of samples, the embeddings extracted from them contain rich, high-dimensional information that capture complex patterns and relationships within biological data. In this case, the aim is to predict the interaction between the protein and the molecule, where a tailored approach is required to accommodate the distinct nature of each input (Figure 4).
All in all, how we apply pre-trained models in our architectures depends on the specific use case. The key is understanding the power of these models and applying them creatively to solve complex biological problems. Loka’s bioengineers leverage BioLLMs to enhance various stages of drug discovery, from target identification to lead optimization. These models enable us to deliver innovative solutions to our clients, accelerating the discovery process and unlocking new therapeutic possibilities.
Beyond Language: What’s Next in GenAI Drug Discovery
The versatile capabilities of BioLLMs—exemplified by their effective application in areas such as variant effect prediction, molecular property prediction and drug-target interaction analysis—underscore their growing ability to address the shortcomings of traditional methodologies. However, while BioLLMs can provide valuable insights into drug-target and protein-protein interactions, they are inherently limited by their inability to incorporate 3D structural information. This spatial context is often critical, as the shape and arrangement of molecules fundamentally influence their interactions.
To bridge this gap, the future of drug discovery depends on multimodal approaches that integrate diverse molecular representations, combining the strengths of BioLLMs with advanced methods like diffusion models and geometric deep learning. This synthesis of cutting-edge AI technologies and biological research promises further transformation in drug discovery, enabling innovative solutions to the complex challenges of drug development and paving the way for groundbreaking therapies. Loka is already taking steps in this exciting new direction, and we’ll take you there in the next installment of our AI and Life Sciences series. Stay tuned!
To learn more about Loka’s BioLLM services, visit Loka's BioLLMs page, or contact Jorge Sampaio at jorge.sampaio@loka.com or Telmo at telmo@loka.com.
References
- [ESM] https://github.com/facebookresearch/esm
- [DNABERT] https://github.com/jerryji1993/DNABERT
- [MolFormer] https://github.com/IBM/molformer