The Aurora supercomputer achieves exascale

The achievement of exascale by the Aurora supercomputer at Argonne National Laboratory marks a significant milestone in the field of high-performance computing.

The Aurora supercomputer, installed in June 2023, is engineered to address some of the world’s most complicated scientific challenges. Aurora is currently the second-fastest supercomputer globally.

With its recent achievement of exascale performance, Aurora unlocks higher levels of accuracy, speed, and power compared to previous generations of supercomputers. This advancement will significantly enhance scientific research in areas such as climate modelling, cancer research and green energy.

To learn more about the Aurora supercomputer, its capabilities, and potential, The Innovation Platform spoke with Mike Papka, Director of the Argonne Leadership Computing Facility and Deputy Associate Laboratory Director of Computing, Environment, and Life Sciences at Argonne National Laboratory, as well as Professor of Computer Science at the University of Illinois Chicago.

Why is Aurora’s achievement of exascale computing a significant milestone? 

Aurora’s achievement of exascale computing is a significant milestone because it marks the ability to perform over a quintillion calculations per second, which is a tremendous leap in computational power. This power enables Aurora to handle diverse scientific tasks, from traditional modelling and simulation to data-intensive workflows and AI/ML applications, all within a single, unified system. Aurora’s architecture, combining powerful CPUs and GPUs, tackles complex problems such as climate modelling, materials discovery, and energy research.

What technological advancements enabled the Aurora supercomputer to surpass the exascale barrier, and how do these innovations contribute to its performance?

Aurora surpassed the exascale barrier thanks to several key technological advancements, including high-bandwidth memory, advanced GPUs, and an interconnect system called Slingshot 11. The Slingshot network, with nearly twice as many end-points as any other large-scale system currently deployed, allows Aurora’s more than 10,000 nodes to deliver massive amounts of data, which is crucial for its performance. This design enables Aurora to be the world’s fastest system for artificial intelligence (AI) (#1 Top500 MxP) and one of the fastest for traditional computing tasks (#2 Top500 HPL).

In what ways can Aurora’s exascale computing power accelerate advancements in artificial intelligence and machine learning?

Aurora’s exascale computing power is driven by its huge amount of memory and many GPUs, which are essential for training large AI models with trillions of parameters. These capabilities were demonstrated in June when Aurora achieved outstanding results in mixed-precision calculations, a key aspect of AI training workloads, even before the full system was operational. This performance highlights Aurora’s ability to accelerate AI and machine learning advancements, allowing researchers to handle massive datasets and develop more sophisticated models that can drive breakthroughs in various scientific fields.

Can you elaborate on the simulations and experiments planned to be conducted using Aurora and how its capabilities will enhance these studies?

Although Aurora is not yet in full production, real-world codes are already running on the system with excellent results. These include projects from the Argonne Leadership Computing Facility’s (ALCF) Early Science Program and the Exascale Computing Project, covering areas like energy science, cancer research, and cosmology. These applications are producing new science results at scales that were impossible on previous systems – showcasing Aurora’s capabilities even before its official launch. (See here)

Aurora’s advanced technology will greatly enhance these studies by enabling more detailed and complex simulations. Aurora expands the possibilities for scientific research, allowing for breakthroughs in some of the most challenging areas, particularly in energy science. Full production is expected in 2025.

Did you face any challenges in the development and deployment of Aurora? What lessons have been learned that can be applied to future supercomputing projects?

The development and deployment of Aurora encountered many challenges, including delays due to vendor decisions and pandemic-related supply chain issues, which extended the timeline. Unlike previous projects, these issues revealed the need for more flexibility in acquisition strategies. The rigid acquisition models currently used today make it difficult to adapt to the fast-moving changes in the field, where technology evolves rapidly.

We deployed other powerful systems during the delays, allowing science teams to continue their work. (See Polaris and AI Testbed) This experience taught us the importance of having adaptable strategies and alternative systems in place, ensuring that research can progress even when facing unforeseen obstacles. For future supercomputing projects, more flexible acquisition models will be crucial to keep pace with the rapid advancements in AI and other technologies.

How do you manage the vast amounts of data collected by Aurora?

Managing the vast amounts of data generated by Aurora is made possible through a combination of its high-speed Slingshot interconnect and its custom filesystem. The filesystem, DAOS (Distributed Asynchronous Object Store), is a high-performance storage system. The Slingshot interconnect delivers exceptional bandwidth to the DAOS filesystem, enabling fast data transfer and storage.

This system is fully integrated into ALCF’s Global Filesystem environment, ensuring that data can be efficiently managed, stored, and accessed across Aurora’s vast compute fabric. This setup supports the high demands of simulations and AI workloads. It contributes to Aurora’s leading performance in data management, as evidenced by its top ranking in the IO500 production list in 2024.

How does Aurora’s energy efficiency and environmental impact compare to previous supercomputers, and what technologies have been employed to reduce its environmental footprint?

Aurora is designed with energy efficiency in mind, utilising advanced technologies to reduce its environmental impact compared to previous supercomputers. The water-cooled system is more efficient than traditional air cooling, and we’ve strategically placed transformers and switchgear as close as possible to minimise energy loss.

Additionally, Aurora is housed in a new state-of-the-art data centre specifically designed to support efficient energy use. While Aurora is a step forward, the entire community still needs to continue improving energy efficiency in future supercomputing projects.

Can you discuss the collaborative efforts between different organisations and institutions in developing Aurora? How did these partnerships contribute to its success?

The success of Aurora is a result of strong collaborative efforts on multiple fronts. First, we partnered with Intel and Hewlett Packard Enterprise (HPE) to design and deploy the system, ensuring it met the demands of our user community. Second, we worked closely with our sister facilities at the Oak Ridge Leadership Computing Facility (OLCF) and the National Energy Research Scientific Computing Center (NERSC), sharing lessons learned and best practices to optimise the development and deployment process.

Finally, our partnership with the Department of Energy’s Exascale Computing Project was crucial. This collaboration increased engagement with industry and helped develop exascale-ready tools and applications, ensuring that Aurora would be equipped to tackle the most complex scientific challenges. These combined efforts have been key to Aurora’s success, setting a new standard for supercomputing.

What are the long-term goals for the Aurora supercomputer, and what are the anticipated next steps in this field?

Aurora is designed to be a key player in an evolving ecosystem of exascale supercomputers aimed at unlocking new possibilities for scientific research and accelerating discoveries. The long-term goal is to develop AI-enabled workflows and models that could revolutionise fields such as clean energy, understanding our universe, and drug discovery.

Aurora is also part of a broader journey in the computing continuum. We are already working on the design of the next-generation system, Helios, which will build on the lessons learned from Aurora. Helios will continue this trajectory of innovation, pushing the boundaries of what supercomputing can achieve in the years to come.

Please note, this article will also appear in the 19th edition of our quarterly publication.

Contributor Details

Michael E. Papka
Argonne National Laboratory
Deputy Associate Laboratory Director, Computing, Environment and Life Sciences Director, Argonne Leadership Computing Facility; Professor of Computer Science, The University of Illinois, Chicago
Website: https://www.anl.gov/profile/michael-e-papka

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Featured Topics

Partner News

Advertisements



Similar Articles

More from Innovation News Network