New supercomputer that offers more methods
Phase 2 of SuperMUC-NG has been installed at the Leibniz Supercomputing Centre and will undergo extensive testing before officially starting operations in spring. In this pilotphase the first scientific codes are also implemented.Now the installation is complete, reserachers and HPC specialist can get to work. During an early user or pilot phase, they will implemen the first scientific codes on SuperMUC-NG Phase 2, or SNG-2 for short, the expansion of the supercomputer at the Leibniz Supercomputing Centre (LRZ). They are also investigating data connections, the cooperation of processors and accelerators, and the interaction of memory components. Dr. Gerald Mathias, Head of the Computational X Support Team (CXS), says: "Phase 2 requires new programming paradigms to execute parts of the codes and workloads on the GPUs. The programmes must therefore be adapted and routines reprogrammed and, of course, we want to understand the capabilities of the system, how we can improve data input and output during operation, and how SNG-2 responds to power saving measures."
Integrating AI into HPC
SNG-2 was developed to accelerate computations and to integrate AI methods into established HPC workflows, that are increasingly used by researchers. Each of its 240 compute nodes, which are based on Lenovo's ThinkSystem SD650-I V3 Neptune DWC servers and cooled with 450 C hot water, contain two CPUs (Intel Xeon Platinum 8480+) and four Intel Graphics Processing Units (Data Centre GPU Ponte Vecchio). The latter process data faster, for example for classic simulation tasks, but they are also suitable for highly scalable, compute- and data-intensive workloads, such as machine learning. These tasks are also supported by a Distributed Asynchronous Object Storage system (DAOS), based on Intel Optane memory, that accelerates access to large amounts of data. According to the latest IOP500 list from November 2023, SNG-2 achieves a performance of 17.19 PetaFLOPS per second, which corresponds to around 17 quadrillion floating point operations, ranking it second among production systems. Its performance can be requested via the Gauss Centre for Supercomputing (GCS).
Prior to the official launch in the spring, the system will be extensively tested and equipped with useful programming tools and applications during this early user phase: In addition to the general HPC software stack and the Intel One API tools, the two GPU-optimised astrophysics programs OpenGadget and DPEcho as well as Gromacs and Amber, two molecular dynamics applications, and the SeisSol (seismology) and CP2K (quantum chemistry) codes have already been implemented in collaboration with the CXS team and Intel specialists, as well as practical tools such as the Kokkos framework for coding C++ applications. The observations and analyses focus on how the parallel processors are addressed by scientific codes, and how users can quickly access data and manage computational results - functionalities that should run smoothly in everyday scientific work.
Learning new and different processes
"To exploit the potential of GPUs, the OpenMP programming model and SYCL, an extension of C++, play a major role. OpenMP is widely used in academic applications, but most applications still need to be adapted to SYCL," says Mathias. The SeisSol research project has even even developed a code generator for this task, which is now proving itself on SNG-2 and may even help to adapt other research codes. The DAOS memory is also under particular scrutiny during this test phase. The LRZ team is working with researchers to check that access from different applications and programmes works smoothly and that all container types can be addressed. In addition to these routine tasks, they are also experimenting with new workflows for artificial intelligence (AI) methods at the LRZ. In addition to the HPC tool set, AI frameworks are also gradually being installed, which can then be used to train AI models with large amounts of data or detect patterns in simulation data. The CXS team has been reinforced with data specialists.
Traditionally, the LRZ's high-performance computers have been used by users from a wide range of scientific disciplines, but now the CXS team is also expecting data experts from the specialist disciplines. Workshops and seminars are also being planned to familiarise researchers with the capabilities of SNG-2 and the processes for combining classic computations with AI methods. In recent years, the LRZ's course programme has expanded to include topics relating to AI, machine and deep learning, and Mathias and his colleagues are now organising a hackathon with Intel to optimise HPC codes, as well as further workshops on AI processes at SNG-2 and the optimising of codes for the new system architecture. "This pilot phase with researchers is particularly intensive and exciting this time," says Mathias. "Many things are new and different - we are all learning a lot." All researchers will benefit from this experience once the system is up and running, as the CXS team will support their projects and advise them on all sorts of code implementation tasks. (vs)