EE-HPC

Energy efficient High Performance Computing

EE-HPC is testing an approach for improving energy efficiency in HPC systems by automatically regulating system parameters and settings based on current job requirements.

Energy usage by high-performance computing (HPC) centers is a deciding factor in the procurement and operation of HPC systems. Indeed, the cost of energy over the life cycle of an HPC system constitutes a substantial part of its overall cost. Even within a comprehensive analysis of resource consumption, energy usage is the dominant factor.

One strategy used in some large Tier 0/1 HPC centers to regulate the energy consumption of the complete system involves limiting the energy usage of applications. This approach focuses mainly on taking relatively simple measures, such as limiting the CPU frequency or turning off whole compute nodes.

Modern systems, however, offer a growing number of options that hold high energy savings potential. For example, adjusting system parameters and settings in the runtime environments of OpenMP and MPI can achieve performance improvements that lead to more efficient energy usage. The range of possibilities for optimization extends further to include comparing global load balances and optimizing collective operations in MPI. Nevertheless, determining the optimal settings can be difficult, particularly with respect to HPC systems that run highly diverse applications, where setting global parameters is often not desirable.

The goal of EE-HPC is to improve the overall energy efficiency of HPC centers by optimally adjusting system parameters (not only regarding CPUs but also memory, input/output (I/O), and network parameters) that influence energy usage, based on the jobs and job phases that are running at any particular time. This approach involves regulating and optimizing such parameters in a comprehensive and transparent manner. The project will deliver an open source production environment for job-specific performance and energy modeling, including a method for optimizing and controlling runtime and system parameters.

The composition of the consortium (tier 0/1, tier 2, and the DKRZ as a central national service provider), as well as its networking with project partners in the Gauss Centre for Supercomputing (GCS), the NHR Alliance, and tier 3 centers (HPC.NRW, Konwihr, bwHPC) will ensure that the project results are used widely over the long term.

Runtime

01. September 2022 -
30. November 2025

Project achievements

The project has demonstrated, that adjusting system parameters during job execution can lead to significant energy savings. In particular, adjusting the power cap can often lead to a substantial reduction of energy-to-solution without significant loss of application performance. However, the project has also shown, that measuring application performance is not trivial. Often, simply counting instruction is not sufficient as many of those instructions will not do any meaningful work as for instance when busy-waiting for completion of MPI operations. At HLRS, we have investigated using recurring events related to the usage of MPI as a measure to more reliably estimate application performance.

Future objectives

EE-HPC has treated of compute nodes in a job as behaving the same with regard to their energy usage. While this is a good starting point, and sufficient, for many applications, in a future project, we would like to explore scenarios where nodes behave differently, as for instance in the precedence of dynamic load shifts or heterogenous applications. In particular we wish to investigate the potential of tuning system parameters on each node independently under the constraint of achieving overall energy reductions.

Project partners

Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)
HLRS, Universität Stuttgart (HLRS)
RWTH Aachen University (RWTH)
Deutsches Klimarechenzentrum (DKRZ)
Hewlett Packard Enterprise (HPE)

Funding

Contact

Jose Gracia

Head, Scalable Programming Models and Tools

+49 711 685-87208 jose.gracia(at)hlrs.de