Mass spectrometry, a powerful tool for studying small molecules, has long been helping scientists unlock secrets hidden in plants, microorganisms, and even human tissues. But because of all its strength, this method has serious limitations. It’s difficult to interpret. Each time mass spectrometry analyzes a sample, it produces a complex fingerprint made of peaks and numbers. These patterns are called mass spectra. As data grew, understanding what each of them meant remains a major challenge.
That challenge is now filled with artificial intelligence.
New ways to read molecular fingerprints
A team of scientists led by Dr. Tomáš Pluskal of IOCB Prague, together with Roman Bushuyev and collaborators from the Czech Institute of Technology, created a dream-like new AI model. This system is able to reveal the structure of molecules from raw spectral data faster and more accurately than previous methods. Their work, published in Nature Biotechnology, shows great progress in solving the hidden language of natural chemistry.
Dreams were trained using a method known as self-teacher learning. Over 700 million raw mass spectra were studied from the GNPS repository containing data collected from environmental and biological samples around the world. Without being informed of what a particular spectrum means, the model learned to find patterns, similarities, and hidden features in the data.
Dr. Josefšivic, one of the researchers, compares this process with how language models like ChatGpt learn to understand text. “ChatGpt can infer the meaning of words and the connections between them from the large amount of text,” he says. “Dreams learn to recognize which molecular structures are hidden within the spectrum. It is based on data from millions of examples.”
The unknown challenges of chemistry
Despite decades of research, scientists estimate that less than 10% of naturally occurring small molecules have been discovered. In other words, most of the world’s chemical diversity remains unexplored. These unknown molecules could be key to breakthroughs in medicine, environmental safety, and even understanding life across the globe.
The main problem is not the ability to collect data, but the challenge is to analyze it. When the mass spectrometer is run, two types of data are generated. MS1 is an MS2 that zooms in on a broad overview of the molecules present and into a fragment of a particular molecule.
These MS2 spectra hold real cues to molecular identity, but only about 2% can be matched to known structures using reference libraries. Even advanced machine learning tools cannot confidently annotate more than 10% of the spectrum.
Previous tools relied heavily on limited spectral libraries or manual interpretations by experts. For example, the well-known software Sirius uses complex steps that include combinations, optimizations, and support vector machines to infer molecular fingerprints. It works well, but it still relies on hand-crafted rules and curated data, slowing things down and limiting its reach.
In contrast, dreams skip most of these steps. Learn directly from raw data without the need for human-designed shortcuts or annotated training sets. Predicts masked peaks in the spectrum and estimates when a particular chemical will appear during chromatography. Through this process, we construct a 1,024-dimensional mathematical representation of each spectrum that captures detailed information about the molecular structure.
Chemical Universe Growth Map
One of the most memorable results of this project is Atlas’ dream. This large, interconnected network links mass spectra of over 200 million. Each spectrum is like a vast web page. Similar to the way websites are connected via hyperlinks, the spectra in dreams are connected based on chemical similarity.
Dr. Plascal explains that the network helps scientists explore links they have never noticed before. For example, dreams have found an incredible connection between pesticides, food and human skin. It even led researchers to wonder whether certain pesticides could cause autoimmune conditions like psoriasis. These types of insights have been nearly impossible to find before.
The model is not just theoretical. It already supports real-world tasks. You can guess which chemical elements are present in the molecule, the number of fragments it has, and even if it contains a particular atom, such as fluorine. This last task was particularly surprising.
“Fluorine is present in about a third of all drugs and pesticides, but previously we were unable to reliably detect it in the mass spectrum,” says Roman Bushuyev. After training dreams on millions of spectra and fine-tuned them with thousands of fluorine-containing samples, the model learned to correctly identify fluorine.
The foundation of future discoveries
Dreams represent a turning point in the use of machine learning for chemistry. Instead of relying on small datasets and slowly rule-based tools, researchers have a foundational model that can adapt to many different tasks. It works across a variety of data and experimental conditions, allowing it to be flexible enough for use in areas such as drug development, environmental science, and even searching for life across the globe.
What makes your dreams particularly exciting is the possibility of going even further. Researchers are currently working on the next step. It is to teach models to predict the perfect molecular structure. If successful, it can speed up discovery of new chemicals and allow scientists to navigate the unknown parts of the chemical world much more accurately.
This work also demonstrates the power of self-monitoring learning in science. By learning patterns from raw data without human labels, researchers can uncover hidden relationships and insights previously out of reach.
As Dr. Plascal points out, “The model was trained on tens of millions of spectra from diverse organisms and environments, including plants, microorganisms, food, tissues, and soil samples. This allows us to reveal hidden similarities between spectra at first glance.”
For scientists looking to better understand the components of life, dreams offer a new path forward. This is built on deep data and smarter machines, not speculation.