All Data In One Place
Data storage often consist of isolated solutions. Data can only be transferred between different systems with considerable effort. Centralized storage is more efficient - especially in HPC, where very large amounts of data are generated. The LRZ has implemented this concept with the Data Science Storage.
Research projects generate considerable amounts of data. This raw data must be processed, visualised, shared, and archived on different systems. System-bound isolated applications, which were the standard until a few years ago, are a bottleneck. Copying the data to another target system is very time-consuming. The better way is a data pool that is accessible for all computer systems and thus enables a data-centric approach. At the LRZ, this has already been the reality for some time in the form of Data Science Storage (DSS).
The DSS is explicitly intended for large data volumes, with 25 PB available at the LRZ. And in theory, much more would be possible. Having a future-proof solution has been taken into account from the start: The basis of the DSS is IBM Spectrum Scale, a highly scalable object and file storage system for unstructured data. Data stored on it is combined into so-called DSS containers, which contain not only the actual user data but also information such as the storage capacity of the container or access rights. The owners of the data, usually the principal investigator or a data curator, can assign write and read rights independently.
Connection to the outside
The DSS usually offers two methods of accessing the data. Primarily, the NSD protocol (Network Shared Disk) from IBM Spectrum Scale is used, with which access to the containers takes place via SuperMUC-NG and the Linux cluster. Besides, an NFS gateway (Network File System) is available to transfer data from the DSS to other systems in the LRZ such as the LRZ Compute Cloud, VMware cluster, or bare metal server. The data transfer to locations outside the LRZ is made possible via the transfer service Globus Online.
Due to the design of the DSS for optimal data throughput at extreme data volumes, smaller restrictions have to be accepted in other areas. For example, the data is not replicated on a secondary system - given the sheer mass, this would hardly be economically justifiable. The integrity of the data is ensured via RAID. A small residual risk remains: In the worst case if a system-wide restore is necessary, this can take some weeks if several PBs are involved. A further limitation of the concept: The system is not designed for extremely high availability; the target is 98 to 99 per cent. The focus on the transmission capacity allows for considerable transfers. For example, at the beginning of 2020, astrophysicists from Potsdam (Germany) were able to send half a PB via Garching to NERSC (National Energy Research Scientific Computing Centre) in Berkeley (USA) at up to 4.5 GB/s.
Capacity per application
There are two operating models available to use DSS: For small projects from 20 TB upwards, contingents are available on LRZ's DSS on Demand (DSSOND). For data volumes from four-digit TB upwards, the DSS is set up individually as a joint project. For this purpose, the LRZ analyses the customer's requirements and procures the necessary hardware on his behalf. It will be integrated into the existing infrastructure and provided in a dedicated way. This service is designed for users such as professors, who carry out long-term projects on the computers of the LRZ. To use DSS, an active project is required. For internal users of the SuperMUC-NG the Gauss Centre for Supercomputing provides a dedicated DSS. They can therefore receive a free quota, provided they have an accepted data management plan.
"With DSS, the focus is on data," says Stephan Peinkofer from the LRZ. "All relevant systems can access it. Another important improvement is that not only those who generated the data have access to it. With DSS, the data can be easily be shared with all those who want to use the data in other ways". DSS thus fulfils some of the central requirements of modern research, which has long been global and organised in variable teams. And which generates huge amounts of data. This dimension is demonstrated, for example, by the Terra_Byte cooperation between the LRZ and the German Aerospace Center “Deutsches Zentrum für Luft- und Raumfahrt” (DLR) for the evaluation of data from Earth observation satellites: Data from the European Earth observation programme Copernicus alone has exceeded the 10 PB threshold, and by 2024 the Sentinel satellites of the Copernicus programme will have generated more than 40 PB of data.
Technical specs on LRZ Data Science Storage