Schuchart, J., Kowalewski, R., Fuerlinger, K.: Recent Experiences in Using MPI-3 RMA in the DASH PGAS Runtime.Proceedings of Workshops of HPC Asia. S. 21--30. Proceedings of Workshops of HPC Asia, Chiyoda, Tokyo (2018).
Schuchart, J., Nachtmann, M., Gracia, J.: Patterns for OpenMP Task Data Dependency Overhead Measurements. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., und Müller, M.S. (Hrsg.) OpenMP: Scaling OpenMP for Exascale Performance and Portability - 13th International Workshop on OpenMP, IWOMP 2017. S. 156--168. OpenMP: Scaling OpenMP for Exascale Performance and Portability - 13th International Workshop on OpenMP, IWOMP 2017 (2017).
Starting with version 4.0, the OpenMP standard has introduced data dependencies to provide a way for synchronizing the concurrent execution of task based on dataflow information. This indirect approach to fine-grained sychronization offers a convenient way for creating a task graph without having to explicitly synchronize individual tasks and can be used to parallelize both regular and irregular applications to expose a higher level of concurrency to the runtime system. However, the cost associated with task creation and management, including matching input and output dependencies, is a crucial factor in designing the granularity of individual tasks, i.e., the amount of work to encapsulate in a task. In this work, we present a set of benchmarks designed to determine the overhead associated with dependency management and give an overview of the performance characteristics of a set of compilers widely used in parallel computing. We hope to provide application developers with a way to make informed decisions on the granularity of their tasks given the dependency patterns dictated by the algorithm. Our benchmark results show varying performance characteristics of different implementations that are both interesting and important to have in mind throughout the task design process.
Schuchart, J., Gerndt, M., Kjeldsberg, P.G., Lysaght, M., Horak, D., Riha, L., Gocht, A., Sourouri, M., Kumaraswamy, M., Chowdhury, A., Jahre, M., Diethelm, K., Bouizi, O., Mian, U.S., Kruzik, J., Sojka, R., Beseda, M., Kannan, V., Bendifallah, Z., Hackenberg, D., Nagel, W.E.: The READEX formalism for automatic tuning for energy efficiency. Computing.1--19 (2017).
Energy efficiency is an important aspect of future exascale systems, mainly due to rising energy cost. Although High performance computing (HPC) applications are compute centric, they still exhibit varying computational characteristics in different regions of the program, such as compute-, memory-, and I/O-bound code regions. Some of today's clusters already offer mechanisms to adjust the system to the resource requirements of an application, e.g., by controlling the CPU frequency. However, manually tuning for improved energy efficiency is a tedious and painstaking task that is often neglected by application developers. The European Union's Horizon 2020 project READEX (Runtime Exploitation of Application Dynamism for Energy-efficient eXascale computing) aims at developing a tools-aided approach for improved energy efficiency of current and future HPC applications. To reach this goal, the READEX project combines technologies from two ends of the compute spectrum, embedded systems and HPC, constituting a split design-time/runtime methodology. From the HPC domain, the Periscope Tuning Framework (PTF) is extended to perform dynamic auto-tuning of fine-grained application regions using the systems scenario methodology, which was originally developed for improving the energy efficiency in embedded systems. This paper introduces the concepts of the READEX project, its envisioned implementation, and preliminary results that demonstrate the feasibility of this approach.
Schuchart, J., Hackenberg, D., Schöne, R., Ilsche, T., Nagappan, R., Patterson, M.K.: The Shift from Processor Power Consumption to Performance Variations: Fundamental Implications at Scale.Proceedings of the 1st Workshop on Energy-Aware HPC (Ena-HPC). S. 197-205. Proceedings of the 1st Workshop on Energy-Aware HPC (Ena-HPC) (2016).
Hackenberg, D., Schöne, R., Ilsche, T., Molka, D., Schuchart, J., Geyer, R.: An Energy Efficiency Feature Survey of the Intel Haswell Processor.International Parallel and Distributed Processing Symposium Workshop (IPDPSW). S. 896-904. International Parallel and Distributed Processing Symposium Workshop (IPDPSW) (2015).
Schuchart, J., Waurich, V., Flehmig, M., Walther, M., Nagel, W.E., Gubsch, I.: Exploiting Repeated Structures and Vectorization in Modelica.11th International Modelica Conference. S. 265-272. 11th International Modelica Conference (2015).
Ilsche, T., Hackenberg, D., Graul, S., Schöne, R., Schuchart, J.: Power Measurements for Compute Nodes: Improving Sampling Rates, Granularity and Accuracy. Sixth International Green and Sustainable Conputing Conference (IGSC).1-8 (2015).
Oleynik, Y., Gerndt, M., Schuchart, J., Kjeldsberg, P.G., Nagel, W.E.: Run-Time Exploitation of Application Dynamism for Energy-Efficient Exascale Computing (READEX). In: Plessl, C., Baz, D.E., Cong, G., Cardoso, J.M.P., Veiga, L., und Rauber, T. (Hrsg.) Computational Science and Engineering (CSE), 2015 IEEE 18th International Conference on. S. 347-350. Computational Science and Engineering (CSE), 2015 IEEE 18th International Conference on (2015).
Wang, D., Xu, Y., Thornton, P., King, A., Steed, C., Gu, L., Schuchart, J.: A functional test platform for the Community Land Model. Environmental Modelling & Software. (2014).
Jana, S., Schuchart, J., Chapman, B.: Analysis of Energy and Performance of PGAS-based Data Access Patterns. In: Malony, A.D. und Hammond, J.R. (Hrsg.) Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. S. 15:1-15:10. Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, New York, NY, USA (2014).
Ilsche, T., Schuchart, J., Schöne, R., Hackenberg, D.: Combining instrumentation and sampling for trace-based application performance analysis.Tools for High Performance Computing 2014. S. 123-136. Tools for High Performance Computing 2014 (2014).
Hackenberg, D., Ilsche, T., Schuchart, J., Schöne, R., Nagel, W.E., Simon, M., Georgiou, Y.: HDEEM: High Definition Energy Efficiency Monitoring.Proceedings of the 2nd International Workshop on Energy Efficient Supercomputing. S. 1-10. Proceedings of the 2nd International Workshop on Energy Efficient Supercomputing, Piscataway, NJ, USA (2014).
Ilsche, T., Schuchart, J., Cope, J., Kimpe, D., Jones, T., Knüpfer, A., Iskra, K., Ross, R., Nagel, W.E., Poole, S.: Enabling event tracing at leadership-class scale through I/O forwarding middleware. In: Epema, D.H.J., Kielmann, T., und Ripeanu, M. (Hrsg.) Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing. S. 49-60. Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (2012).
We present a practical lock-free shared data structure that efficiently implements the operations of a concurrent deque as well as a general doubly linked list. The implementation supports parallelism for disjoint accesses and uses atomic primitives which are available in modern computer systems. Previously known lock-free algorithms of doubly linked lists are either based on non-available atomic synchronization primitives, only implement a subset of the functionality, or are not designed for disjoint accesses. Our algorithm only requires single-word compare-and-swap atomic primitives, supports fully dynamic list sizes, and allows traversal also through deleted nodes and thus avoids unnecessary operation retries. We have performed an empirical study of our new algorithm on two different multiprocessor platforms. Results of the experiments performed under high contention show that the performance of our implementation scales linearly with increasing number of processors. Considering deque implementations and systems with low concurrency, the algorithm by Michael shows the best performance. However, as our algorithm is designed for disjoint accesses, it performs significantly better on systems with high concurrency and non-uniform memory architecture.