Testing the emergency

reboot

"Power fluctuations have paralysed the energy supply throughout Pakistan," reported the Tagesschau on 23 January this year. On the same day, 24 IT specialists from all departments of the Leibniz Supercomputing Centre (LRZ) teamed-up to simulate emergency situations like that: Blackout in large parts of Europe, everything stopps working in Germany and its neighouring countries. Everything is down at the LRZ, too: computers and technical resources at the LRZ, the high-performance and storage systems, the network, the cooling technology and the communication links. But how does the LRZ team get the ressources back up and running when the power is restored? "In the best case scenario, it would take two or three days for the first basic services to be up and running again," estimates Stephan Peinkofer, head of the Data Science Storage Infrastructures (DSI) group at the LRZ and organiser of the disaster recovery and business continuity simulation game. "It will certainly take weeks, if not months, until all systems are fully functional, because we will certainly also have to obtain spare parts, replace components and hard disks, and restore backups. After all, a lot of things break during a power outage.

Discussing and developing processes

Such simulation games are part of the requirements of companies and organisations that, like the LRZ, are certified in information security and service quality. They provide important insights into processes and the reassurance that even the biggest emergencies can be handled as a team. And such an excersise is fun: "We had lively discussions about exciting topics, we laughed a lot, and the motivation and concentration of all participants was very high," says Peinkofer. "For example, I had no idea how much depends on the facility management during the commissioning phase and how closely all the departments have to work together. Exciting."

The experts spent a day discussing what to do in an emergency. Often enough, questions arise that no one has asked before: How is it possible to access the computer building, which is electronically secured? During the simulation, the fire brigade would have had to be called to open the doors. "But it is much more important that we can restart the computer technology and infrastructure in a controlled manner after a power outage. This protects the technology. But we would have to be able to turn off the main switches before the power comes back on," says Peinkofer. This would be difficult to plan and would have to be done before the LRZ is shut-down. In addition, if the power supply is initially limited after a major blackout, a decision would have to be made whether to disconnect the batteries that the LRZ normally uses to ensure power supply and the reliability of IT services without interruption. After all, they would charge immediately, consuming a lot of power and potentially delaying the startup of critical infrastructure. On the other hand, the LRZ would be vulnerable to further power fluctuations during this phase; if it was also cold, the computer rooms would have to be warmed up with fan heaters before the computers could be switched on.

The order is important

Act, think, discuss, find solutions: Step by step, the LRZ experts worked out the order in which each service would be restarted on each working day. Networks, power management equipment, then the computers for basic IT services, which mainly include communication tools, the Internet and the LRZ's cloud storage. "In order to set up the right sequence, it is important that people from building management, different departments and working groups take part in the simulation game. They ask the right questions," explains organiser Peinkofer. "At first I was afraid that things would get mixed up, but when we got together it was very constructive, structured and focused. A list of around 20 tasks or open to-dos, such as access to the computer building, was drawn up and ranked in order of urgency.

Simulation games are not complex at all, is another experience of the working group: Paper and pens help, but above all a moderator and a protocol. With the help of the minutes, the facilitator can bring the participants back to the original topic if the discussion veers off course or gets lost in detailed questions. "We were all too optimistic about the time frame, but the main thing is that we've gone through it," says Peinkofer. "The simulation game worked well, we have to live with the risk of a blackout, but we are prepared, we know the processes and the key drivers for restoring our services". A nice side effect: the restart plan developed for the LRZ certification has been confirmed in the game. Only the order of some measures has changed. (vs)