biology is a decipherable, programmable, in some ways even digital system.
DNA encodes the complete genetic instructions for every living organism on earth using just four variables—A (adenine), C (cytosine), G (guanine) and T (thymine).
every protein in every living being consists of and is defined by a one-dimensional string of amino acids linked together in a particular order. Proteins range from a few dozen to several thousand amino acids in length, with 20 different amino acids to choose from.
biology can be thought of as an information processing system
Large language models are at their most powerful when they can feast on vast volumes of signal-rich data, inferring latent patterns and deep structure that go well beyond the capacity of any human to absorb.
language of life
Proteins are involved in virtually every important activity that happens inside every living thing: digesting food, contracting muscles, moving oxygen throughout the body, attacking foreign viruses. Your hormones are made out of proteins; so is your hair.
Proteins are so important because they are so versatile. They are able to undertake a vast array of different structures and functions, far more than any other type of biomolecule. This incredible versatility is a direct consequence of how proteins are built.
every protein consists of a string of building blocks known as amino acids strung together in a particular order. Based on this one-dimensional amino acid sequence, proteins fold into complex three-dimensional shapes that enable them to carry out their biological functions.
A protein’s shape relates closely to its function. To take one example, antibody proteins fold into shapes that enable them to precisely identify and target foreign bodies, like a key fitting into a lock. As another example, enzymes—proteins that speed up biochemical reactions—are specifically shaped to bind with particular molecules and thus catalyze particular reactions. Understanding the shapes that proteins fold into is thus essential to understanding how organisms function, and ultimately how life itself works.
Determining a protein’s three-dimensional structure based solely on its one-dimensional amino acid sequence has stood as a grand challenge in the field of biology for over half a century. Referred to as the “protein folding problem,” it has stumped generations of scientists. One commentator in 2007 described the protein folding problem as “one of the most important yet unsolved issues of modern science.”
AlphaFold was not built using large language models. It relies on an older bioinformatics construct called multiple sequence alignment (MSA), in which a protein’s sequence is compared to evolutionarily similar proteins in order to deduce its structure.
the total set of proteins that exist in the human body—the so-called “human proteome”—is estimated to number somewhere between 80,000 and 400,000 proteins.
We will be able to design new protein therapeutics to address the full gamut of human illness—from cancer to autoimmune diseases, from diabetes to neurodegenerative disorders. Looking beyond medicine, we will be able to create new classes of proteins with transformative applications in agriculture, industrials, materials science, environmental remediation and beyond.
Protein sequence data can be tokenized and for all intents and purposes treated as textual data; after all, it consists of linear strings of amino acids in a certain order, like words in a sentence. Large language models can be trained solely on protein sequences to develop a nuanced understanding of protein structure and biology.
“Today we can for all practical purposes read, write, and edit any sequence of DNA, but we cannot compose it. The code of life is a symphony, guiding intricate and beautiful parts performed by an untold number of players and instruments. Maybe we can cut and paste pieces from nature’s compositions, but we do not know how to write the bars for a single enzymic passage.”
These novel proteins will serve as therapeutics for a wide range of human illnesses, from infectious diseases to cancer; they will help make gene editing a reality; they will transform materials science; they will improve agricultural yields; they will neutralize pollutants in the environment; and so much more that we cannot yet even imagine.
Language models can be used to generate other classes of biomolecules, notably nucleic acids. A buzzy startup named Inceptive, for example, is applying LLMs to generate novel RNA therapeutics.
Other groups have even broader aspirations, aiming to build generalized “foundation models for biology” that can fuse diverse data types spanning genomics, protein sequences, cellular structures, epigenetic states, cell images, mass spectrometry, spatial transcriptomics and beyond.
The ultimate goal is to move beyond modeling an individual molecule like a protein to modeling proteins’ interactions with other molecules, then to modeling whole cells, then tissues, then organs—and eventually entire organisms.
The idea of building an artificial intelligence system that can understand and design every intricate detail of a complex biological system is mind-boggling. In time, this will be within our grasp.
“FrameDiff” is a computational tool that uses generative AI to craft new protein structures, with the aim of accelerating drug development and improving gene therapy.
At the heart is DNA, the master weaver that encodes proteins, responsible for orchestrating the many biological functions that sustain life within the human body.
What if we had gene editing technology capable of automatically producing proteins to rectify DNA errors that cause cancer? The quest to identify proteins that can strongly bind to targets or speed up chemical reactions is vital for drug development, diagnostics, and numerous industrial applications, yet it is often a protracted and costly endeavor.
“FrameDiff,” a computational tool for creating new protein structures beyond what nature has produced. The machine learning approach generates “frames” that align with the inherent properties of protein structures, enabling it to construct novel proteins independently of preexisting designs, facilitating unprecedented protein structures.
The aim, with respect to this new capacity of generating synthetic protein structures, opens up a myriad of enhanced capabilities, such as better binders. This means engineering proteins that can attach to other molecules more efficiently and selectively, with widespread implications related to targeted drug delivery and biotechnology, where it could result in the development of better biosensors. It could also have implications for the field of biomedicine and beyond, offering possibilities such as developing more efficient photosynthesis proteins, creating more effective antibodies, and engineering nanoparticles for gene therapy.
Proteins have complex structures, made up of many atoms connected by chemical bonds. The most important atoms that determine the protein’s 3D shape are called the “backbone,” kind of like the spine of the protein. Every triplet of atoms along the backbone shares the same pattern of bonds and atom types. Researchers noticed this pattern can be exploited to build machine learning algorithms using ideas from differential geometry and probability. This is where the frames come in: Mathematically, these triplets can be modeled as rigid bodies called “frames” (common in physics) that have a position and rotation in 3D.
By learning to construct existing proteins, the algorithm hopefully will generalize and be able to create new proteins never seen before in nature.
In 2021, DeepMind introduced AlphaFold2, a deep learning algorithm for predicting 3D protein structures from their sequences. When creating synthetic proteins, there are two essential steps: generation and prediction. Generation means the creation of new protein structures and sequences, while "prediction" means figuring out what the 3D structure of a sequence is. It’s no coincidence that AlphaFold2 also used frames to model proteins. SE(3) diffusion and FrameDiff were inspired to take the idea of frames further by incorporating frames into diffusion models, a generative AI technique that has become immensely popular in image generation, like Midjourney, for example.
RosettaCommons/RFdiffusionThis new tool brought protein designers closer to solving crucial problems in biotechnology, including the development of highly specific protein binders for accelerated vaccine design, engineering of symmetric proteins for gene delivery, and robust motif scaffolding for precise enzyme design.
Protein structure predictionis the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure.
Ribbon diagramalso known as Richardson diagrams, are 3D schematic representations of protein structure and are one of the most common methods of protein depiction used today.
Gene predictionor gene finding refers to the process of identifying the regions of genomic DNA that encode genes.
Pseudogene (database)is a database of pseudogenes annotations compiled from various sources.
Protein structureis the three-dimensional arrangement of atoms in an amino acid-chain molecule.
De novo protein structure predictionrefers to an algorithmic process by which protein tertiary structure is predicted from its amino acid primary sequence.
Protein–protein interaction predictionis a field combining bioinformatics and structural biology in an attempt to identify and catalog physical interactions between pairs or groups of proteins.
Protein function predictionmethods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins.
List of protein structure prediction softwaretools in protein structure prediction, including homology modeling, protein threading, ab initio methods, secondary structure prediction, and transmembrane helix and signal peptide prediction.
Homology modelingalso known as comparative modeling of protein, refers to constructing an atomic-resolution model of the "target" protein from its amino acid sequence and an experimental three-dimensional structure of a related homologous protein (the "template").
Protein designis the rational design of new protein molecules to design novel activity, behavior, or purpose, and to advance basic understanding of protein function.
Non-coding RNAis a functional RNA molecule that is not translated into a protein.
Sequence alignmentis a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
Multiple sequence alignment (MSA)may refer to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA.