Jump to main navigation Jump to main navigation Jump to main content Jump to footer content

Attended Cluster Node Housing

Attended housing of Cluster nodes comprises the physical installation, network connection, operation and administration of customer-owned computational hardware in 19" racks, and their integration with the LRZ Linux cluster. The available infrastructure including UPS (20sec autonomy time), climatization and access controls is exploited. Procurement of supported hardware must follow LRZ guidelines.

Remark: LRZ does not support "unattended housing" of cluster hardware.

Scope of Services and Service Specifics

Housing of compute systems in the Centre

  • 19" Racks in the computer rooms
  • Climatization
  • Doubly redundant power supply (230V), non-exclusive
  • Connection to UPS with an autonomy interval of at least 20 seconds

Monitored operation in the centre

Remote management (access to systems through remote control)

Network connection

  • IP addresses from the Linux Cluster are assigned to housed nodes (public IPv4 subnet: 129.187.20.0/24, public IPv6 subnet: 2001:4ca0:0:200::, private subnets in VLAN 67). In case of operational constraints, a housed system might be assigned a different subnet.
  • Bandwidth of the connection to the outside world may be 10 GBit/s or 40 GBit/s. 100 GBit/s are only available in specific parts of the LRZ infrastructure.
  • internal connection of nodes with each other and to integrated storage is through a high performance network with at least 100 GBit/s per node

Operational management of cluster nodes

  • Installation and Administration of the operating environment
  • Maintenance of hardware and system software
  • Integration to the Cluster's batch scheduler
  • Connection to the Cluster file systems as well as the central software repository
  • System monitoring
  • Professional data backup and restore procedures through use of the central tape facilities (IBM Spectrum Protect)
  • Support (includes incident resolution) via LRZ's Service Desk

Optional Services

  • Procurement of hardware
  • Provisioning and Maintenance of specific software packages, if know-how is available at LRZ

Service Parameter

Technical Requirements for the procurement of cluster hardware

  • Requirements are determined jointly by LRZ and the customer. If the desired hardware can be acquired via an existing framework contract or a framework agreement that includes the customer's institution in its ambit, no procurement process is needed. Otherwise LRZ supports the customer, as far as necessary, in establishing procurement documents. Specific weight is put upon easy integrability of target hardware into LRZ's operational concept. In order to keep the hardware profile as uniform as possible, LRZ usually prescribes vendor and type of the target hardware. Exceptions from this need special justification and involve negotiation of special housing rates.
  • Inasmuch as a vendor delivers and installs a customized software stack or system software (usually for non-standard systems), the customer is obliged to contract a maintenance agreement for all software components with the vendor as a prerequisite for operating such a system in LRZ's facilities.
  • The deployed hardware must support a facility for remote reset of all systems, as well as extraction of CPU temperatures and fan speeds for all systems, running the Linux operating environment
  • A sufficient number of power distributions units (PDUs) that must be capable of reading out current power consumption and permit remote management must be part of the delivery.
  • The hardware acquistion shall include all necessary management components needed for installation, operation and surveillance (e.g. management switches, network cables).
  • All hardware shall be delivered with a warranty period - including on-site service - of 3 years. After expiration of the warranty, the customer is obliged to facilitate necessary repair measures by contracting extended hardware maintenance. Procurement of replacement parts for hardware that is out of warranty is not LRZ's responsibility and therefore must be done by the customer.
  • The following alternatives are available for cooling the housed systems: either, cold water cooling on the rack level, or direct (system-level) warm water cooling. The latter requires modification of the system boards and the possibility to connect the system to LRZ's cooling circuits; acquisition of such systems therefore needs to be done by LRZ. Furthermore, the customer will be charged in proportion for usage of rack and chassis space as part of the invest cost.

Supported Operating Environments

  • SLES for x86_64 architecture
  • The deployed release level depends on the availabililty of support by SuSE; it may occasionally change. Any changes will be timely announced to the customer by LRZ, because customer will usually need to perform maintenance of any applications implemented on the housed systems.

Security

  • Configuration of the cluster firewall is agreed upon with customer's institution. Deployment of a firewall is obligatory.
  • On login nodes, necessary updates will be continually deployed, possibly with short operational interruptions. Compute nodes that are not directly accessible from the outside world will be updated during scheduled maintenance periods.

Operational Concept

  • operation of housed cluster nodes is aligned with that of the LRZ cluster with respect to user management, batch scheduling and user acess.
  • housed systems shall be integrated with the existing clusters.

Incidents

  • Reporting of operational incidents shall be done via LRZ's Service Desk. They shall be assigned the service classification „High Performance Computing - Attended Cluster Node Housing“.
  • Incidents that provably result from issues in domain-specific application programs installed into the operating environment, LRZ reserves the right to remand analysis and resolution back to the customer. The same applies for incidents pertaining to non-commercial applications from the LRZ software stack, if LRZ sees no chance for successful resolution due to lack of know-how or disproportional amount of time needed.
  • Incidents raised against systems without a valid hardware maintenance contract will only be processed by LRZ if the effort needed for resolution is considered adequate. Evaluation of this is done at LRZ's discretion.

Retirement from operation

  • LRZ reserves discretion on retiring from operation non-standard systems whose maintenance contract has expired
  • Retired systems must be collected from the LRZ facilities by the customer within 8 weeks; professional disposal is incumbent on the customer, unless an appropriate arrangement has been contracutally agreed with the system vendor.

Maintenance interruptions of duration 2-5 days are scheduled once or twice per year. These are announced at least 14 calendar days in advance.

Availability target for the service is 95%. Scheduled maintenance intervals are not accounted.

Requirements

A housing contract with LRZ must be established. 

User Guidelines

The service-specific guidelines for the use of the MWN (https://www.lrz.de/wir/regelwerk/ - therein guidelines in the network area), the archive and backup system (ABS, https://doku.lrz.de/benutzungsrichtlinien-11475999.html) and the online (https://doku.lrz.de/cloud-storage-richtlinien-zur-nutzung-11476144.html) or DSS storage (https://doku.lrz.de/dss-terms-and-conditions-11476130.html) must be observed.

The policies for usage of file systems and tape archive are documented at https://www.lrz.de/wir/regelwerk/richtlinien_filesysteme_HPC/

Further details or deviations from the standard offering are, where necessary, covered by an explicit servicel level agreement (SLA).

Liability regulations

The following liability regulations apply automatically to all contracts in the area of "Attended Cluster Housing":

  1. The contracting parties shall be liable for intent and negligence in the event of a breach of material contractual obligations, i.e. obligations that make the proper execution of the contract possible in the first place and on the observance of which the other contracting party may regularly rely, but in the case of simple negligence limited to the foreseeable damage typical for the contract.
  2. Otherwise, liability shall be limited to intent and gross negligence.
  3. Liability for consequential and financial losses shall be excluded.
  4. Limitations and exclusions of liability shall not apply to damages resulting from injury to life, limb or health or to claims under the Product Liability Act.

Procurement guidelines

Please note the following deadlines for your enquiries and procurement:

  • 31.03. of the current year    
  • 30.06. of the current year
  • 15.09 of the current year

Housing enquiries for the current year can no longer be accepted after the 3rd deadline. The coordination, quotation and procurement cycle takes time, so requests after this deadline can only be processed in the following year.

User / Customers

This service is made available to the following user classes. The following fees are to be paid by the individual user classes:

User ClassCost Rate
1Own Costs (operating + investment costs)
2Own Costs (operating + investment costs)
3Own Costs (operating + investment costs)
4Not Available
5Not Available
6Not Available

Fees

The table provides an overview of the cost rates incurred for housing cluster nodes at the LRZ. The categories are defined as follows:

  • Procurement: one-off cost rates for the purchase of new components. A three-year warranty is included.
  • Energy: annual cost rates for electricity and cooling. Based on consumption measurements. The climate factor (cooling surcharge) is specified depending on the cooling technology used.
  • Operation: annual cost rates for operation and other infrastructure.

Where necessary, all cost rates include the statutory VAT rate of 19%.

Notes

  • As the framework agreements for the procurement of hardware do not specify fixed prices for their entire duration (mainly due to exchange rate fluctuations against the US dollar), maximum prices are entered for the third quarter of 2024. An offer must be obtained for specific procurements.
  • GPFS node licences are only counted for existing systems (installed before 2021). For newer systems, a disc-based licence model is used, which is mapped in the cost model for DSS usage where necessary.
CategoryDescriptionUnitRemarks
ProcurementIntegration in Net and Managementper GPU NodeOn RequestPro rata costs for switches, uplinks, management switches, installation and acceptance
ProcurementIntegration in Net and Managementper CPU NodeOn RequestPro rata costs for switches, uplinks, management switches, installation and acceptance
ProcurementGeneral Purpose Computing Systemsone compute node (2 sockets, HDR 100 GBit/s Inifiniband, direct hot-water cooled)
Minimum equipment: 56 cores, 256 GByte main memory per blade. Even numbers must be procured.
On RequestDepending on the memory configuration, processor type and other hardware equipment
ProcurementAccelerated Computing Systemone accelerated node (2 sockets, 4 accelerators NVidia H100 or Intel Ponte Vecchio, HDR 200 GBit/s Infiniband, direct hot-water cooled)-There is currently no framework agreement.
ProcurementRack Shareper Accelerated Compute NodeOn RequestThe racks for water-cooled systems are procured in advance by the LRZ and therefore invoiced separately.
ProcurementRack Shareper CPU Compute NodeOn RequestThe racks for water-cooled systems are procured in advance by the LRZ and therefore invoiced separately.
ProcurementInitial InstallationCompute NodeOn RequestWork on physical and logical integration into the operating environment
ProcurementInitial InstallationAccelerated NodeOn RequestWork on physical and logical integration into the operating environment
EnergyElectricity and Coolingper kW average output and yearOn RequestCost rate for air-cooled systems that are installed in cold water-cooled racks. Climate factor: 1:3
EnergyElectricity and Coolingper kW average output and yearOn RequestCost rate for direct water-cooled systems. Climate factor: 1:1
OperationAdministration Fees (1)per year and CPU Compute NodeOn RequestStandard set
OperationAdministration Fees (2)per year and GPU Compute NodeOn RequestAlso applies to special systems that cannot be integrated into the regular cluster infrastructure.
OperationOperating SystemSLES licence for 2-Socket NodeOn RequestSignificant price increase due to SuSE's changed licence model

 

Note: Please refer to the ‘HPC software and programming support service option’ for the fee rate for exceptional application support.