A lot can change in four years.
In January 2021, Advanced Research Computing, a division within the VA Tech Division of Information Technology, introduced Infer, the first graphical processing unit (GPU) cluster capable of processing critical artificial intelligence (AI) inference tasks. .
Through AI inference, scientists use machine learning models trained on existing data to infer something about new datasets. This method can also be used to predict future events or to interpret things that were previously incomprehensible, such as what a cow is “saying” when it moots.
These capabilities were notable at the time, but over the past four years, research leveraging AI, not to mention vast datasets such as large-scale language models, has expanded rapidly at Virginia Tech, and Infer now requires significantly more GPU power than is provided by
“The explosive growth in data generation and, more recently, the development of AI, is driving new levels of synthesis and analysis in every field,” said Alberto Cano, Vice President of Research Computing. states.
Cano oversees Advanced Research Computing, which provides high-performance computing resources and expertise to Virginia Tech’s research community. “While Infer provided a capable resource for processing moderate datasets for AI and machine learning calculations, GPUs capable of processing large language models are not readily available in 2021. did.”
Falcon offers power, speed and flexibility
Falcon is Virginia Tech’s newest GPU cluster. It went live in December and is available to faculty and students through Advanced Research Computing.
Falcon consists of 52 nodes with 4 GPUs each for a total of 208 GPUs. To give you maximum flexibility in the types of research you can perform using Falcon, there are two types of GPUs in your cluster.
32 are equipped with NVIDIA A30 GPUs with 24 gigabytes of memory per GPU. They support double floating point (FP64) calculations, provide ultra-high precision output with minimal rounding errors, and are used in scientific research applications such as fluid mechanics and product design. The other 20 nodes are equipped with NVIDIA L40S GPUs with 48 gigabytes of memory per GPU. Large Language Models (LLMs) require this significant amount of memory.
Falcon’s interconnect fabric, the technology that allows data to move within and between GPUs, is also a significant upgrade in terms of increased speed and reduced latency. This is Advanced Research Computing’s first cluster to use next-generation data rate InfiniBand, a network communication technology characterized by very high data transfer rates and very low latency. Each Falcon node is equipped with a double data rate connector, delivering speeds of 200 Gbit/s per node. This is up to 20 times faster than what was possible with Infer. This can significantly reduce loading or transfer times. Large dataset.
“Maintaining a state-of-the-art HPC (high performance computing) infrastructure is essential to advancing both the university’s mission and the broader scientific community. With the introduction of NVIDIA A30 and L40s GPUs, we In addition to upgrading your system, this new generation of GPUs is unlocking unprecedented opportunities for AI, data science, and complex simulation research. “This enables our researchers to tackle larger, more complex problems, accelerate discovery and take innovation to new heights,” said Cano.
All of this translates into a powerful and versatile GPU cluster suitable to meet the current and future needs of Virginia Tech research teams.
After installation this summer, Advanced Research Computing collaborated with several research teams that served as early testers for Falcon. Several of these researchers are part of the Virginia Tech Learning on Graphs Lab (VLOG), directed by Dawei Zhou, a computer science graduate student and assistant professor of computer science.
“Falcon’s computational power is essential to my research on quantifying uncertainty in large-scale language models, enabling efficient experimentation and analysis. As part of the VLOG Lab, I am leveraging Falcon to We will strengthen the reliability of AI and promote reliable methods for LLM and graph neural networks,” said Tuo Wang.
“Falcon will greatly enhance my research on protein disorder by enabling high-performance computational simulations and large-scale analysis of protein disordered regions and interactions. and facilitate deeper insight into the dynamic behavior of intrinsically disordered proteins,” said Xinyue “Susan” Zeng.
Sina Mostafanejad, a software scientist in the Department of Chemistry and Molecular Science Software Institute, was also an early test user of Falcon. “The new NVIDIA L40S and A30 GPUs in the Falcon cluster will help power research in training and testing large-scale language models underlying chemistry and accelerate new scientific discoveries in computational molecular science,” said Mosta. said Fanejad.
For more information about Falcon and other high-performance computing resources, visit arc.vt.edu.
Written by Kit Hayes