With research and supercomputing against crises

strom

SuperMUC-NG and it's colleagues at LRZ work with renewable energies. Foto: K. Wurth/Unsplash

Modern science needs computing power –and therefores electricity. A lot of electricity, in fact. Researchers, in turn, provide solutions for overcoming crises - and strategies for efficient energy use.

• Researchers work on finding solutions to energy crises. In addition, they develop innovative tools and strategies for a more efficient use of energy in IT in cooperation with manufacturers.

• Reduced clock frequencies and energy management, the virtualisation of infrastructure as well as (warm) water cooling systems reduce power requirements in data centres and make waste heat usable.

• Supercomputing is a at the forefront of science: this is where the tools are being prepared to make optimum use of hardware and to programme energy-efficient software.

When the SuperMUC-NG, the supercomputer of the Leibniz Supercomputing Centre (LRZ), is running at full speed, it needs 3.4 megawatts or 3400 kilowatts of electricity. That's quite a lot, about 75 per cent of the LRZ's total demand. But with this power, however, the supercomputer produces models and calculations that we can use, for example, to better understand environmental phenomena, which enables us to plan precautionary protective measures against extreme weather. Researchers use high-performance computers (HPC) to develop drugs, therapies or materials, and to find solutions and methods for other research disciplines, for technology and IT.

Abandoning research and supercomputing in order to reduce electricity consumption is not the way out of the energy crisis, especially since high-performance computers cannot easily be switched off and started up again. Politics and society should see both as part of the solution: At the high-performance computers of research institutions as well as at the three national supercomputing centres of the Gauss Centre for Supercomputing, scientists have been successfully working for years on reducing the power consumption of computers and simultaneously increasing their performance. In close cooperation with technology providers, they develop tools and optimise technology that drive technical progress in IT.

If, as feared in many places, there are bottlenecks in electricity and energy supply this winter, it will be due to the Russian attack on Ukraine. This is a challenge which will help us prepare ourselves for greater tasks: There has to be generally a transition in the energy supply must be generally switched from fossil fuels to renewable energies in order to stop global warming and climate change. Gaps in the energy supply cannot be ruled out, but with the help of research, testing and optimising innovative technology, this transition can succeed.

Automatically switching off what is not being used

At the Leibniz Supercomputing Centre (LRZ) in Garching, we were able to take precautions: SuperMUC-NG and all other HPC resources, data storage and networks have been running for more as ten years on 100 per cent electricity from renewable sources such as solar, water and wind power. The supplier compensates for any fluctuations caused by the weather. Until end of 2024, 95 per cent of the electricity demand is contractually fixed; we are affected by rising prices when we have to buy smaller amounts of electricity on the spot market. Several scenarios were calculated and contingency plans drawn up for possible major gaps in supply.

But it is not only since electricity became scarce and expensive that the focus has been on energy requirements in IT and data centres. For economic reasons – money saved on electricity can be invested in hardware and so in more compute capacity – many practicable tools for reducing energy requirements have been developed in the energy-intensive supercomputing centres and further improved in close exchange. A race for the optimum, from which the economy and society profited, because personal computers and mobile devices also produced more and more power as power consumption fell.

The high-performance computers at LRZ do not consume the highest amounts of energy every day, but only when processors and computing nodes are running at full speed. SuperMUC-NG mostly works with a reduced clock frequency, instead of the possible 2.7 it operates only 2.3 gigahertz. Many applications do not benefit from a higher clock frequency anyway, but without this measure the computers would consume up to 30 per cent more electricity on an annual average. A further reduction of the clock frequency would do little good in the current situation, because applications would then calculate longer and thus consume more power.

Computers are sometimes referred to as "heaters with integrated logic functions". Up to 60 percent of the energy consumed don’t flow into computing. That’s highly inefficient, and so today almost all supercomputing centres on the international top 500 list, which ranks the performance of the high-performance-computers, use water cooling. This makes fans and chillers – which require additional electricity - largely superfluous and the waste heat usable. The power usage efficiency factor (PUE) has proven itself as a measure of energy efficiency in the data centre. It indicates the proportion of electricity that is not used for computing. The optimum would be 1, 100 percent energy run into computing. According to the international evaluation centre Uptime Institute high-performance-computers worldwide achieved an average of just under 1.6 in 2021. The LRZ's supercomputer achieves a PUE of 1.06 – 0.06 kilowatts flow into infrastructure such as cooling. This higher efficiency was achieved in close cooperation with technology provider Lenovo using test runs and a gradual increase in water temperature. Today, water with a temperature of up to 50 degrees flows through the racks of the SuperMUC-NG, which is further heated by the waste heat. Stored in this way, it can be used to heat neighbouring offices, greenhouses or used for other purposes. For this reason, 30 data centres in Sweden are already integrated into district heating networks. The LRZ could also give off heat. But to absorb it, buildings in the vicinity would have to be converted. We observe that such solutions are now finally being discussed more often. Further opportunities to curb the energy demand of cooling and to make better use of waste heat lie in the close collaboration between IT departments and building management.

What has been developed, optimised and adapted in supercomputing is becoming more and more established in conventional data centres. In addition to water cooling and reducing clock frequencies, they also focus on the virtualizing of hardware and software. This helps to save space and energy, thus increasing the availability of IT services and reducing power requirements. If energy management tools are also used, the demand is reduced by up to 30 percent: Distributed Power Management automatically shuts down hardware and switches it off when it is not needed. For the same reason, IT systems set up for testing purposes in computing and supercomputing centres like the LRZ are only activated when there are tasks for them.

Teamwork between stakeholders in industry and research

Better work scheduling also reduces the energy demand in the data centres and in supercomputing: researchers and technology providers are focusing on what is known as Energy-Aware Scheduling and they are developing more and more tools for this purpose, employing more and more artificial intelligence methods of late. Computing jobs are combined in such a way that memory and processors are kept as busy as possible. This also reduces power consumption. At present, the focus is on data transfer within a system, where there are still many opportunities to reduce energy consumption and get more performance out of it.

For more efficiency, computing processes must be intensively monitored at all times. On the base of this operating data, it is hoped that supercomputers can be controlled in a smarter way and that processes can be further automated. Here, too, supercomputers will set the pace for IT innovations: In the more than 6480 computing nodes of the SuperMUC-NG system, around 15 million sensors collect data on performance, temperature, component load, its handling of software and applications. To analyse this information, LRZ specialists have developed the open-source software Data Centre Data Base (DCDB) and a first systematic approach. Both are publicly accessible, shared and discussed with other supercomputing centres, and then improved and developed further. And they could soon form the basis of smart computer control, which will be highly welcome in data centres as well as by computer manufacturers. Thanks to this operating data, it will be possible to better adapt software to the requirements of a computer: programming offers further opportunities to throttle the energy demand, even if the effects are difficult to assess.

The current energy crisis is compelling us all to use resources more efficiently. Targeted data management and the networking of science has turned out to be another area of research that can help curb the demand for electricity: if data is searchable, generally available and reusable, a lot more insights can be drawn from it. A model that might well prove its worth in the industry. Researchers see crises primarily as a challenge. Electricity and its scarcity electrify them and give rise to flashes of inspiration for new, necessary solutions. This attitude can set an example for all of us: A high level of curiosity and the spirit of research can help us see the opportunities in this crisis and face the future with greater optimism. (Prof. Dr. Dieter Kranzlmüller)