Proteins in the wrong parts of a cell can contribute to several diseases, such as Alzheimer’s disease, cystic fibrosis, and cancer. However, since a single human cell has around 70,000 different proteins and protein variants, and scientists can usually only test in one experiment, it is very expensive and time-consuming to manually identify the location of a protein.
New generations of computational techniques are attempting to streamline processes using machine learning models that often utilize datasets containing thousands of proteins and their locations measured in multiple cell lines. One of the biggest such datasets is the human protein atlas. This catalogs the intracellular behavior of more than 13,000 proteins in more than 40 cell lines. However, it is enormous that human protein Atlas explores only about 0.25% of all possible pairings of all protein and cell lines in the database.
Now, researchers from MIT, Harvard University and the wider laboratories at MIT and Harvard have developed new computational approaches that allow efficient exploration of the remaining unknown space. These methods can predict the location of any protein in a human cell line, even if both the protein and the cell have never been previously tested.
Those techniques go a step further than many AI-based methods by localizing proteins at the single cell level rather than as an averaged estimate across all cells of a particular type. This single cell localization allows for example to locate proteins in specific cancer cells after treatment.
Researchers combined protein language models with special types of computer vision models to capture rich details about proteins and cells. Ultimately, the user receives an image of the cell with the highlighted portion that shows the model’s prediction of where the protein is. Because protein localization indicates its functional state, this technique helps researchers and clinicians to more efficiently diagnose diseases and identify drug targets, while also helping biologists to better understand how complex biological processes are associated with protein localization.
“We hope that these protein localization experiments can be done on a computer without touching the lab bench, saving months of effort. We need to validate our predictions, but this technique could work like the first screening of what we’re testing experimentally.”
Tseo has joined the paper with co-star Xinyi Zhang, a graduate student in the Department of Electrical Engineering and Computer Science (EECS), and Eric and Wendy Schmidt Center from Broad Institute. Yunhao Bai from the Broad Institute; and senior author Fay Chen, an assistant professor at Harvard University and a member of the Broad Institute, Caroline Wooler, engineering professor at Andrew and Elna Vitel at EECS, and Caroline Wooler at MIT Institute (IDSS) at MIT Data, Systems and Society (IDSS), Director of the Eric and Wendy Schmidt Center, and Director of the Wendy Schmidt Center, Researcher at MIT Institute at MIT Institute. This study is published today in Essential Methods.
Collaboration model
Many existing protein prediction models can only make predictions based on the fact that protein and cell data are trained or unable to identify the location of the protein within a single cell.
To overcome these limitations, researchers have created a two-part method for predicting the intracellular position of an invisible protein called pup.
The first part utilizes protein sequence models to capture its 3D structure based on the protein’s localization-determining properties and the chains of amino acids that form it.
The second part incorporates an image starting model designed to fill in the missing part of the image. This computer vision model examines three stained images of a cell and collects information about the cell’s condition, including its type, individual features, and whether it is under stress.
The puppy combines the representations created by each model to predict where the protein is within a single cell and uses an image decoder to output a highlighted image indicating the predicted location.
“Different cells within a cell line exhibit different properties, and our model can understand the nuances,” says Zeo.
The user enters a sequence of proteins and amino acids that form three cell stain images. One for the nucleus, one for microtubules, and another for three cell reticulums. Then the puppy leaves behind.
Deeper understanding
Researchers employed several tricks during the training process to teach puppies how to combine information from each model to make educated guesses about the location of the protein, even if the protein has not been seen before.
For example, assign a quadratic task to the model during training. Explicitly name the localization compartment, like the cell nucleus. This is done along with the main installation tasks so that the model can be learned more effectively.
A good analogy might be a teacher who, in addition to writing his name, asks his students to draw every part of the flower. This additional step has been found to help improve the general understanding of cell compartments that the model can have.
Furthermore, the fact that PUP is simultaneously trained with proteins and cell lines helps us to a deeper understanding of where it tends to localize in the proteins in cell images.
Puppies can even understand on their own how different parts of the protein sequence contribute separately to overall localization.
“We’ve already seen it in training data because most other methods usually require protein staining first. Our approach is unique in that it allows proteins and cell lines to be generalized simultaneously,” says Zhang.
Puppies can generalize to invisible proteins, allowing them to capture localisation changes driven by unique protein mutations not found in human protein atlas.
The researchers confirmed that puppies could predict the intracellular location of new proteins in invisible cell lines by conducting lab experiments and comparing results. Furthermore, when compared to the baseline AI method, puppies averaged prediction errors across the proteins tested.
In the future, researchers want to enhance puppies, so the model can understand protein-protein interactions and predict localization of multiple proteins within the cell. In the long term, puppies want to be able to make predictions from the perspective of living human tissues rather than cultured cells.
The research is funded by the Broad Institute’s Eric and Wendy Schmidt Center, the National Institutes of Health, the National Science Foundation, the Burrows Welcome Fund, the Searls Collers Foundation, the Harvard Stem Cell Research Institute, the Markin Institute, the Naval Laboratory, and the Department of Energy.