CoolMUC: New system, new rules - Leibniz-Rechenzentrum

The new CoolMUC system has nearly completed 100 successful working days: This is a report on initial practical experience with the new computing cluster at the LRZ and what users can expect in the coming months.

A new computing cluster has arrived at the Leibniz Supercomputing Centre (LRZ) and went live on December 10, 2024. “The fourth generation of our CoolMUC is already running at high capacity,” says Dr. Gerald Mathias, head of the Computational X Support (CXS) the at the LRZ. “We’re now starting to analyse which levers we need to pull to improve its utilisation.” The high-performance cluster, primarily used by Munich and Bavarian universities, replaces its predecessors CoolMUC-2 and CoolMUC-3, which proved to be reliable workhorses but were ultimately shut down in December 2024. Users of the current CoolMUC are already adapting to the new rules and practices brought by the modernised technology.

Powerful processors for modelling and simulation

After the first 100 days with CoolMUC, its administrators are drawing an initial conclusion: “The performance density is significantly higher than in previous generations, and the competition among users for computing resources is increasing,” observes Mathias. “It’s now harder to fully utilise an entire node on the new system.” This is due to the compact, high-performance technology: while processors in previous generations had only 28 or 64 cores per compute node, the new Intel processors are equipped with up to 112 cores per node. This requires more targeted job management. In the past, applications could skip entire nodes on the cluster, but now research groups share the node with other scientists. Even large jobs that run in parallel across many cores often only partially utilise a node.

The cluster is composed of three different Intel processors: Ice Lake, Cooper Lake, and Sapphire Rapids. In total, 121 compute nodes are available: two Ice Lake nodes, each with 80 cores and one terabyte of memory, are reserved exclusively for system operation and management, not for computation. These nodes run the new operating system Suse Linux Enterprise Server (SLES 15, SP16), execute control and planning tools and serve as log-in node for compiling and testing programmes or for preparing simulation runs.

Processing cores for different tasks

Researchers from Munich and Bavarian universities can quickly gain access to the remaining 119 nodes after submitting a brief project description. The computing cluster offers various capabilities: 106 of the nodes are equipped with Sapphire Rapids processors. Each of these provides 512 gigabytes of RAM and 112 cores. These are suitable for jobs where individual cores perform the same task or where many cores compute simultaneously, but do not require large memory or can distribute data across multiple storage devices. Jobs that require more data upload or caching during execution are better suited to the Ice Lake processors: these 12 nodes each offer 80 cores and one terabyte of short-term memory.

Finally, there is Teramem, a Cooper Lake processor node with 96 cores and around six terabytes of RAM: “Incidentally, this is the only chip or system at the LRZ that provides memory for processing very large data volumes of up to 6 terabytes on a single node,” Mathias notes. “Typical Teramem jobs likely use little CPU power in shared memory applications and run programs without distributed memory parallelization.” However, the LRZ specialists advise against running applications with the Message Passing Interface (MPI) on Teramem.

Accelerated computing and artificial intelligence

For greater energy efficiency, the computing cluster will soon be equipped with GPU accelerators. In 2026, when more concrete data on usage and demand is available, CoolMUC will be expanded with additional Central Processing Units (CPUs). “Thanks to BayernKI, there are sufficient resources at the LRZ and at the Friedrich-Alexander University’s computing center in Erlangen for machine learning and other AI applications,” explains Mathias. CoolMUC and the systems of the National High Performance Computing (NHR) in Erlangen complement BayernKI and vice versa. This differentiated concept responds to a higher need for more computing resources in Bavaria: high-performance computing is now essential to nearly every scientific discipline.

Following the traditional fields of HPC, such as physics and engineering, more and more disciplines are now using the computing cluster: biologists, geologists, and medical researchers are modelling and simulating; business administration, psychology, and education are also computing here. Even historians and artists use CoolMUC – for example, to build virtual scenes and spaces or to analyse social statistics. CoolMUC, therefore, faces a wide range of jobs. “We allocate computing power, not computing time – basically, researches of all disciplines can use CoolMUC,” Mathias emphasises. “We see jobs that require only a few hundred core hours as well as projects that consume many millions.” The principle of “fair share” applies in scheduling: users who frequently and intensively use the system in a short period may have to wait for available capacity later, as new groups are granted access. “Due to its performance density, we will monitor resource use more strictly and plan more efficiently,” Mathias says. “Larger projects requiring more time or power may be redirected to the NHR centers or even the Gauss Centre for Supercomputing (GCS), which allocates time on SuperMUC-NG and Germany’s two other supercomputers.”

New technologies are making high-performance computing more complex. Although many scientific codes still need to be adapted to the current CoolMUC, it’s not becoming more complicated. The CXS team is preparing a cheat sheet that will give users a quick and clear overview of job types, key commands, and usage rules. The LRZ and its partner institutes also regularly offer training and workshops on using the CoolMUC cluster. And finally, every week, specialists from the CXS team are available in the LRZ HPC lounge to answer practical questions, explain access, help with job planning, and share useful tips. (vs)