Jump to main navigation Jump to main navigation Jump to main content Jump to footer content

ScalNEXT

Technologie:Supercomputing Forschungsbereich:Future Computing

The ScalNEXT (Scalable Network-centric EXecuTion) project deals with the optimisation of data management and the control flow of computing nodes for supercomputing.

Modern HPC systems are usually organised as cluster systems. This means that individual and usually completely independent nodes with their own operating system instances are only connected by a rough resource or job management system and linked by a network. The networks used for this, such as Infiniband, Slingshot or Tofu, often offer high bandwidths, but their latency is limited by physical parameters and they are also usually passive, i.e. they are only used for communication between the nodes. In addition to the actual computing tasks, data management (partitioning, distributed models, etc.) and control of the control flow (message patterns, dependencies, task and thread management, etc.) remain distributed at the nodes and therefore at maximum distance. The latter leads to high latencies for management and control tasks, scaling bottlenecks due to a high number of active end components, as well as communication bottlenecks due to the need for synchronisation messages. As the performance of the computing nodes increases - both in terms of pure computing power and energy efficiency - the gap between node and network widens. In classic HPC applications, this increases the pressure on the network and thus leads to performance losses. It also increases the need to increase network bandwidths, e.g. through cost-intensive multi/many-rail connections. As a result, network costs already account for a considerable proportion of overall system costs.

However, modern networks offer the possibility of shifting many of these tasks to the network, thereby anchoring them more centrally in the system and avoiding scaling problems. These so-called smart networks, which are reconfigurable and programmable, are already being used in modern telecommunications and data centres, together with technologies such as software-defined networks (SDNs). However, they have hardly been used in the HPC sector to date. Several challenges still need to be solved to make this possible. These include the secure virtualisation of network resources at user level, the development of simple APIs that are compatible with existing programming approaches and the redesign of operating systems with global, cross-network approaches.

The SCALNEXT project addresses these challenges and develops new technologies to enable the use of smart networks in the HPC sector. The goal of SCALNEXT is to increase the scalability of HPC systems and applications. We will develop basic technologies that enable the outsourcing of core functionality of data management and control flow away from nodes into the network (on NICs and switches), and we will apply them there to the three core application areas of modelling and simulation (ModSim), data analysis and I/O (HPDA) and machine learning (ML/AI). In all three areas, this will on the one hand relieve the computing nodes, which can then be fully dedicated to the necessary calculations; on the other hand, management and control tasks will be transferred to the more closely linked and more centrally located network resources. This results in a significant increase in computational efficiency in the nodes, the possible outsourcing of calculations close to the data, as well as a significant increase in scalability (e.g. because instead of a large number of parallel processes, a significantly smaller number of NICs or switches take over the tasks).

At the LRZ, the techniques developed in the project and the associated scaling gains will have a direct influence on future systems such as the ExaMUC exascale system.

Start dateOct 1, 2022
End dateSep 30, 2025
Funding AgencyBMBF
PartnersTechnische Universität München (Consortium Lead), 
Karlsruher Institut für Technologie (KIT), 
Johannes Gutenberg-Universität Mainz, 
Rheinisch-Westfälische Technische Hochschule Aachen, 
APS Networks (until March 2024)
Total Budgettba