the study
Published September 19, 2023 Author
Dziga Avsek and Jun Chen
New AI tool classifies impact of 71 million ‘missense’ mutations
Unraveling the root causes of disease is one of the greatest challenges in human genetics. With millions of possible mutations and limited experimental data, which mutations can cause disease remains largely a mystery. This knowledge is essential for faster diagnosis and the development of life-saving treatments.
Today, we are publishing a catalog of ‘missense’ mutations to help researchers learn more about what impact they may have. Missense mutations are genetic mutations that can affect the function of proteins in humans. In some cases, it can cause diseases such as cystic fibrosis, sickle cell anemia, and cancer.
The AlphaMissense catalog was developed using AlphaMissense, a new AI model that classifies missense variants. A paper published in the journal Science found that 89% of all 71 million possible missense variants were classified as either likely pathogenic or likely benign. I showed it. In contrast, only 0.1% are confirmed by human experts.
AI tools that can accurately predict the impact of variants have the power to accelerate research across fields from molecular biology to clinical genetics to statistical genetics. Experiments to uncover disease-causing mutations are expensive and labor-intensive. Every protein is unique and each experiment must be designed individually, which can take months. Using AI predictions, researchers can get previews of results for thousands of proteins at once, helping them prioritize resources and accelerate more complex studies.
We have made all our predictions freely available for commercial and research use and have open sourced the AlphaMissense model code.
AlphaMissense predicted the pathogenicity of all 71 million possible missense variants. It classified 89% and predicted that 57% were likely benign and 32% were likely pathogenic.
What is a missense variant?
A missense variant is a single letter substitution in DNA that results in a different amino acid in the protein. If you think of DNA as a language, changing just one letter can change the word and completely change the meaning of the sentence. In this case, the substitution changes the amino acid that is translated, which can affect the function of the protein.
The average person has more than 9,000 missense variants. Most are benign and have little effect, while others are pathogenic and can significantly disrupt protein function. Missense variants can be used to diagnose rare genetic diseases where a small number or a single missense variant can directly cause disease. They are also important for studying complex diseases like type 2 diabetes, which are caused by a combination of many different types of genetic changes.
Classification of missense variants is an important step in understanding which changes in these proteins can cause disease. Of the more than 4 million missense mutations already identified in humans, only 2% have been annotated by experts as pathogenic or benign, leaving 71 million possible missense mutations. This corresponds to approximately 0.1% of the total. The remainder are considered “variants of unknown significance” due to a lack of experimental or clinical data regarding their effects. Using AlphaMissense, we now have the clearest picture to date by classifying 89% of variants using a threshold that yields 90% accuracy in our database of known disease variants. It has become.
Pathogenic or Benign: How AlphaMissense Classifies Variants
AlphaMissense is based on AlphaFold, a breakthrough model that predicts the structure of nearly every protein known to science from its amino acid sequence. Our fitted model can predict the pathogenicity of missense variants that alter individual amino acids in a protein.
To train AlphaMissense, we fine-tuned AlphaFold with labels that distinguish between mutations found in humans and closely related primate populations. Variants that are commonly seen are treated as benign, and variants that are never seen are treated as pathogenic. AlphaMissense does not predict changes in protein structure due to mutations or other effects on protein stability. Instead, it utilizes a database of relevant protein sequences and the structural context of the variant to generate a score between 0 and 1 that roughly assesses the likelihood that the variant is pathogenic. Continuous scoring allows users to select a threshold for classifying mutations as pathogenic or benign that meet accuracy requirements.
Diagram showing how AlphaMissense classifies human missense variants. When a missense mutation is entered, the AI system scores it as potentially pathogenic or benign. AlphaMissense combines structural context and protein language modeling and is fine-tuned based on human and primate mutation population frequency databases.
AlphaMissense delivers state-of-the-art predictions across a wide range of genetic and experimental benchmarks without explicitly training on such data. Our tool outperformed other computational methods when used to classify variants from ClinVar, a public archive of data on human variant-disease relationships. We find that our model is also the most accurate method for predicting results from the laboratory and is consistent with different methods of measuring virulence.
AlphaMissense outperforms other computational methods in predicting missense variant effects.
Left: Comparison of the performance of AlphaMissense and other methods in classifying variants from the Clinvar public archive. The methods shown in gray were trained directly on ClinVar, and some of the training variants are included in this test set, so their performance on this benchmark may be overestimated.
Right: Graph comparing the performance of AlphaMissense and other methods in predicting measurements from biological experiments.
Building community resources
AlphaMissense is built on AlphaFold to advance the world’s understanding of proteins. A year ago, we announced 200 million protein structures predicted using AlphaFold. It helps millions of scientists around the world accelerate their research and pave the way for new discoveries. We look forward to seeing how AlphaMissense helps solve open questions at the heart of genomics and the biological sciences as a whole.
We have made AlphaMissense’s predictions freely available to both the commercial and scientific communities. We are working with EMBL-EBI to make it even easier to use through the Ensembl Variant Effect Predictor.
In addition to a lookup table of missense mutations, we shared expanded predictions of all 216 million possible single amino acid sequence substitutions across over 19,000 human proteins. We also included the average prediction for each gene. This is similar to measuring the evolutionary constraints of genes. This indicates how important that gene is to the survival of the organism.
Example of an AlphaMissense prediction overlaid on an AlphaFold prediction structure (red = predicted to be pathogenic, blue = predicted to be benign, gray = uncertain). Red dots represent known pathogenic missense variants and blue dots represent known benign variants from the ClinVar database.
Left: HBB protein. Variants of this protein can cause sickle cell anemia.
Right: CFTR protein. Variants of this protein can cause cystic fibrosis.
Accelerating genetic disease research
An important step in translating this research is to collaborate with the scientific community. We have been working with Genomics England to explore how these predictions can help research the genetics of rare diseases. Genomics England cross-referenced AlphaMissense’s findings with previously compiled variant virulence data on human participants. Their evaluation confirmed that our predictions were accurate and consistent, providing AlphaMissense with another real-world benchmark.
Although our predictions are not designed for direct use in the clinic and must be interpreted in conjunction with other sources of evidence, this study could improve diagnosis of rare genetic diseases and identify new disease causes. It may be useful for gene discovery.
Ultimately, the hope is that AlphaMissense, along with other tools, will help researchers better understand diseases and develop new life-saving treatments.
Learn more about AlphaMissense here.
Precautions
*As of March 13, 2024, AlphaMissense predictions are available under the CC BY v.4 license, which lifts the previous non-commercial use restriction. Please refer to the public database and Zenodo for detailed access information.
We would like to thank Juanita Bawagan, Jess Valdez, Katie McAtackney, Kathryn Seager, and Hollie Dobson for their help with the text and figures. We would also like to thank our external partners Genomics England and EMBL-EBI for their continued support. This research was made possible thanks to the contributions of co-authors: Guido Novati, Joshua Pan, Clare Bycroft, Akvilė Žemgulytė, Taylor Applebaum, Alexander Pritzel, Lai Hon Wong, Michal Zielinski, Tobias Sargeant, Rosalia G. Schneider, Andrew W. Sr., John Jumper, Demis Hassabis, Pushmeet Kohli. Also, Catherine Tunyasvnakur, Rob Fergus, Eliseo Papa, David La, Zachary Wu, Sarah Jane Dunn, Kyle R. Taylor, Natasha Latisheva, I would also like to thank Hamish Tomlinson, Augustin Zidek, Roz Onions, Mira Lutfi, John Small and Molly. Thanks to Beck, Annette Obika, Hannah Gladman, Folake Abu, Alyssa Pierce, James Tam, Q Green, Meera Last, Tharindi Hapuarachchi, and the greater Google DeepMind team for their support, assistance, and feedback. I will.