Listen to this article:
AI is playing a transformative role in drug discovery, revolutionizing the way researchers design new drugs, therapies and molecules. At the forefront of this transformation are large language models (LLMs) and diffusion models, each offering unique opportunities for creating novel drugs and medications.
LLMs focus on text-based tasks, like generating human-like text and answering complex queries. In our two previous articles we explained what BioLLMs are and how they’re transforming drug discovery by accelerating research in ways previously thought impossible. For instance, BioLLMs such as ProGen trained on clinical data are capable of generating novel and functional protein designs by treating amino acid sequences like language.
LLMs are powerful but their application comes with unique challenges, particularly in balancing diversity and quality in the sequences they generate. LLMs can be overly conservative, suggesting compounds that are already well-known, which limits the discovery of truly novel compounds. Tweaking sampling parameters like temperature can introduce randomness, producing a broader range of unique sequences. However, this can result in hallucinations or low-quality outputs, namely sequences that are non-functional and/or unable to fold properly into stable 3D protein structures.
While LLMs excel at generating sequence-level information, they are limited in directly inferring the 3D structure of proteins, which often requires spatial modeling and an understanding of geometric relationships. In these areas, diffusion models have emerged as a powerful alternative tool. Initially designed for image and video processing applications, diffusion models are being repurposed to solve drug discovery problems that rely on the accurate modeling of complex 3D structures, including protein folding and drug-protein interactions.
What remains to be seen is whether diffusion models can fully meet these challenges or if their strengths complement rather than replace those of LLMs. Are diffusion models the ultimate solution for AI-driven drug discovery? To find out, we must explore the unique capabilities, trade-offs and synergies between them and BioLLMs.
Diffusion Models: Unlocking 3D Complexity
The first paper on diffusion models emerged in 2015, introducing the concept of gradually adding noise to data and learning how to reverse this process to generate data samples. However, it wasn't until 2020-2021 that diffusion models became widely recognized, specifically with the paper "Denoising Diffusion Probabilistic Models" (DDPMs), which showcased their ability to generate high-quality images. Since then, diffusion models have evolved and applied to tasks beyond image generation, particularly in drug discovery.
Why are diffusion models so well-suited for drug discovery? Two reasons: First, diffusion models excel at generating detailed, realistic outputs. The process of adding noise in an iterative and controlled manner and then learning to gradually refine it allows the model to learn and incorporate fine details at each stage. Second, diffusion models follow a simple and stable training approach. The model’s single-objective framework (optimizing likelihood) and the stepwise denoising process make training reliable.
A notable example of a diffusion model applied to drug discovery is RFdiffusion. Based on RosettaFold, this model generates high-quality protein structures by learning how to reverse a noising process applied to protein samples from the Protein Data Bank.
data:image/s3,"s3://crabby-images/e0f1e/e0f1e2fc2e2d5468e9009b8c76f480bfd74b1a7a" alt=""
A protein structure is represented by atom coordinates defining a 3D shape that proteins naturally fold into. The process starts with perturbing amino acid positions and orientations until we get this noisy version of the protein structure, much like a blurred image. The model is trained to denoise this random noise by gradually refining its predictions, ultimately producing realistic protein backbones. The training process minimizes the mean squared error (MSE) between the predicted and true protein structures across all residues. With each step, RFdiffusion learns to create diverse, functional proteins, unlocking endless possibilities for drug discovery.
What sets RFdiffusion apart from sequence-based models like LLMs and makes it a natural fit for protein design is that it operates directly on amino acid coordinates, unlike sequence-based models like LLMs. Additionally, the process of starting from pure random noise inherently ensures endlessly diverse outputs.
A Practical Use Case: Generating Nanobody Structures
To provide a technical overview and a practical example of how these models can be used in drug discovery, let's focus on the challenge of generating nanobody structures, which are increasingly critical in real-world applications such as targeted therapies and diagnostics.
As previously mentioned, LLMs like ProGen show promise in generating nanobody sequences but often fail to balance diversity and quality. While aiming to produce a wide range of valid sequences, we ended up generating many that were invalid. To assess the quality of the valid outputs, we would then rely on an auxiliary model to predict the 3D structure from the generated sequences.
With RFdiffusion, we can directly generate backbone structures, providing greater control over the process. RFdiffusion’s capabilities allow us to fix certain parts of a structure, condition generation on a target and even manipulate the structure's topology. Although RFdiffusion was primarily trained on protein structures, Loka’s bioengineers explored its potential for generating other biological structures, specifically nanobodies.
This approach began with motif scaffolding, a feature in RFdiffusion that allows us to fix specific regions of the structure and define how they connect, including the number of residues involved. For instance, we can focus on generating CDR3, the most variable region of a nanobody, while preserving the rest of the structure. These experiments produced promising results: some structures had root mean square deviation (RMSD) values below 2Å compared to the reference nanobody. Such low RMSD values indicate a high degree of structural similarity, suggesting that RFdiffusion has an intrinsic understanding of nanobody topology.
Next, we wanted to condition the generation on a specific target protein. RFdiffusion allows for inputting a target protein structure, but this presents new challenges. The model’s ability to freely generate structures based on the target protein required intervention to avoid undesirable outcomes: as the model was given more freedom, it tended to generate helical topologies that were inconsistent with typical nanobody structures.
To counteract this effect, we explored fold conditioning, a feature of RFdiffusion that helps guide generation toward specific topologies. While this approach helped, it occasionally led to inaccuracies in the binding site’s location or the structure’s orientation relative to the target.
While RFdiffusion demonstrated strong potential for nanobody design, it was clear that achieving these promising results came with significant challenges. The model required substantial fine-tuning and a high level of technical expertise to navigate its complex features and ensure the desired outcomes. From adjusting fold conditioning to optimizing motif scaffolding, each step demanded careful adaptation to avoid undesired results. Additionally, the process was time-consuming, requiring considerable computational resources to tweak each feature effectively and ensure accuracy.
Despite these challenges, RFdiffusion demonstrated strong potential for nanobody design: our experiments showcased its ability to produce nanobody structures with high similarity to known references, as evidenced by RMSD metrics, while also uncovering opportunities for further refinement. Its versatility in structure generation opens exciting avenues for future innovation.
Beyond the Showdown: Toward a Unified Frontier
Rather than competitors, LLMs and diffusion models are perhaps best understood as complementary, tackling different critical aspects of the drug-discovery process. LLMs excel in sequence generation, offering the ability to create novel protein sequences through their text-like understanding of amino acid chains. However, they fall short when spatial and geometric accuracy is required. Diffusion models, such as RFdiffusion, address this limitation by focusing on the 3D structure of proteins, bringing a new level of accuracy and diversity to protein design. That said, they also present some challenges: They can be computationally expensive, and training or fine-tuning them is not a trivial task, requiring significant resources and technical expertise.
Depending on the specific application, it may be more beneficial to leverage one class of models over the other, or even combine their strengths to overcome their individual limitations. Ultimately, the choice between LLMs and diffusion models will depend on the precise goals of the drug discovery process, the level of accuracy required and the available resources. Both models are incredibly powerful, and used together they could pave the way for even greater breakthroughs in drug discovery.