Recent years have seen the emergence of two independent programming models challenging the traditional two-tier combination of message passing and thread-level work-sharing: partitioned global address space (PGAS) and task-based concurrency. In the PGAS programming model, synchronization and communication between processes are decoupled, providing significant potential for reducing communication overhead. At the same time, task-based programming allows to exploit a large degree of shared-memory concurrency. The inherent lack of fine-grained synchronization in PGAS can be addressed through fine-grained task synchronization across process boundaries. In this work, we propose the use of task data dependencies describing the data-flow in the global address space to synchronize the execution of tasks created in parallel on multiple processes. We present a description of the global data dependencies, describe the necessary interactions between the distributed scheduler instances required to handle them, and discuss our implementation in the context of the DASH PGAS framework. We evaluate our approach using the Blocked Cholesky Factorization and the LULESH proxy app, demonstrating the feasibility and scalability of our approach.
Schuchart, J., Kowalewski, R., Fuerlinger, K.: Recent Experiences in Using MPI-3 RMA in the DASH PGAS Runtime.Proceedings of Workshops of HPC Asia. S. 21--30. Proceedings of Workshops of HPC Asia, Chiyoda, Tokyo (2018).
Schuchart, J., Tsugane, K., Gracia, J., Sato, M.: The Impact of Taskyield on the Design of Tasks Communicating Through MPI. In: de Supinski, B.R., and Valero-Lara, P., and Martorell, X., and Mateo Bellido, S., und and Labarta, J. (Hrsg.) Evolving OpenMP for Evolving Architectures - 14th International Workshop on OpenMP, IWOMP 2018, Barcelona, Spain, September 26-28, 2018, Proceedings. S. 3--17. Evolving OpenMP for Evolving Architectures - 14th International Workshop on OpenMP, IWOMP 2018, Barcelona, Spain, September 26-28, 2018, Proceedings (2018).
The OpenMP tasking directives promise to help expose a higher degree of concurrency to the runtime than traditional worksharing constructs, which is especially useful for irregular applications. In combination with process-based parallelization such as MPI, the taskyield construct in OpenMP can become a crucial aspect as it helps to hide communication latencies by allowing a thread to execute other tasks while completion of the communication operation is pending. Unfortunately, the OpenMP standard only provides little guarantees on the characteristics of the taskyield operation. In this paper, we explore different potential implementations of taskyield and present a portable black-box tool for detecting the actual implementation used in existing OpenMP compilers/runtimes. Furthermore, we discuss the impact of the different taskyield implementations on the task design of the communication-heavy Blocked Cholesky Factorization and the difference in performance that can be observed, which we found to be over 20%.
Schuchart, J., Nachtmann, M., Gracia, J.: Patterns for OpenMP Task Data Dependency Overhead Measurements. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., und Müller, M.S. (Hrsg.) OpenMP: Scaling OpenMP for Exascale Performance and Portability - 13th International Workshop on OpenMP, IWOMP 2017. S. 156--168. OpenMP: Scaling OpenMP for Exascale Performance and Portability - 13th International Workshop on OpenMP, IWOMP 2017 (2017).
Starting with version 4.0, the OpenMP standard has introduced data dependencies to provide a way for synchronizing the concurrent execution of task based on dataflow information. This indirect approach to fine-grained sychronization offers a convenient way for creating a task graph without having to explicitly synchronize individual tasks and can be used to parallelize both regular and irregular applications to expose a higher level of concurrency to the runtime system. However, the cost associated with task creation and management, including matching input and output dependencies, is a crucial factor in designing the granularity of individual tasks, i.e., the amount of work to encapsulate in a task. In this work, we present a set of benchmarks designed to determine the overhead associated with dependency management and give an overview of the performance characteristics of a set of compilers widely used in parallel computing. We hope to provide application developers with a way to make informed decisions on the granularity of their tasks given the dependency patterns dictated by the algorithm. Our benchmark results show varying performance characteristics of different implementations that are both interesting and important to have in mind throughout the task design process.
Schuchart, J., Gerndt, M., Kjeldsberg, P.G., Lysaght, M., Horak, D., Riha, L., Gocht, A., Sourouri, M., Kumaraswamy, M., Chowdhury, A., Jahre, M., Diethelm, K., Bouizi, O., Mian, U.S., Kruzik, J., Sojka, R., Beseda, M., Kannan, V., Bendifallah, Z., Hackenberg, D., Nagel, W.E.: The READEX formalism for automatic tuning for energy efficiency. Computing.1--19 (2017).
Energy efficiency is an important aspect of future exascale systems, mainly due to rising energy cost. Although High performance computing (HPC) applications are compute centric, they still exhibit varying computational characteristics in different regions of the program, such as compute-, memory-, and I/O-bound code regions. Some of today's clusters already offer mechanisms to adjust the system to the resource requirements of an application, e.g., by controlling the CPU frequency. However, manually tuning for improved energy efficiency is a tedious and painstaking task that is often neglected by application developers. The European Union's Horizon 2020 project READEX (Runtime Exploitation of Application Dynamism for Energy-efficient eXascale computing) aims at developing a tools-aided approach for improved energy efficiency of current and future HPC applications. To reach this goal, the READEX project combines technologies from two ends of the compute spectrum, embedded systems and HPC, constituting a split design-time/runtime methodology. From the HPC domain, the Periscope Tuning Framework (PTF) is extended to perform dynamic auto-tuning of fine-grained application regions using the systems scenario methodology, which was originally developed for improving the energy efficiency in embedded systems. This paper introduces the concepts of the READEX project, its envisioned implementation, and preliminary results that demonstrate the feasibility of this approach.
Schuchart, J., Hackenberg, D., Schöne, R., Ilsche, T., Nagappan, R., Patterson, M.K.: The Shift from Processor Power Consumption to Performance Variations: Fundamental Implications at Scale.Proceedings of the 1st Workshop on Energy-Aware HPC (Ena-HPC). S. 197-205. Proceedings of the 1st Workshop on Energy-Aware HPC (Ena-HPC) (2016).
Hackenberg, D., Schöne, R., Ilsche, T., Molka, D., Schuchart, J., Geyer, R.: An Energy Efficiency Feature Survey of the Intel Haswell Processor.International Parallel and Distributed Processing Symposium Workshop (IPDPSW). S. 896-904. International Parallel and Distributed Processing Symposium Workshop (IPDPSW) (2015).
Schuchart, J., Waurich, V., Flehmig, M., Walther, M., Nagel, W.E., Gubsch, I.: Exploiting Repeated Structures and Vectorization in Modelica.11th International Modelica Conference. S. 265-272. 11th International Modelica Conference (2015).
Ilsche, T., Hackenberg, D., Graul, S., Schöne, R., Schuchart, J.: Power Measurements for Compute Nodes: Improving Sampling Rates, Granularity and Accuracy. Sixth International Green and Sustainable Conputing Conference (IGSC).1-8 (2015).
Oleynik, Y., Gerndt, M., Schuchart, J., Kjeldsberg, P.G., Nagel, W.E.: Run-Time Exploitation of Application Dynamism for Energy-Efficient Exascale Computing (READEX). In: Plessl, C., Baz, D.E., Cong, G., Cardoso, J.M.P., Veiga, L., und Rauber, T. (Hrsg.) Computational Science and Engineering (CSE), 2015 IEEE 18th International Conference on. S. 347-350. Computational Science and Engineering (CSE), 2015 IEEE 18th International Conference on (2015).
Wang, D., Xu, Y., Thornton, P., King, A., Steed, C., Gu, L., Schuchart, J.: A functional test platform for the Community Land Model. Environmental Modelling & Software. (2014).
Jana, S., Schuchart, J., Chapman, B.: Analysis of Energy and Performance of PGAS-based Data Access Patterns. In: Malony, A.D. und Hammond, J.R. (Hrsg.) Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. S. 15:1-15:10. Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, New York, NY, USA (2014).
Ilsche, T., Schuchart, J., Schöne, R., Hackenberg, D.: Combining instrumentation and sampling for trace-based application performance analysis.Tools for High Performance Computing 2014. S. 123-136. Tools for High Performance Computing 2014 (2014).
Hackenberg, D., Ilsche, T., Schuchart, J., Schöne, R., Nagel, W.E., Simon, M., Georgiou, Y.: HDEEM: High Definition Energy Efficiency Monitoring.Proceedings of the 2nd International Workshop on Energy Efficient Supercomputing. S. 1-10. Proceedings of the 2nd International Workshop on Energy Efficient Supercomputing, Piscataway, NJ, USA (2014).
Ilsche, T., Schuchart, J., Cope, J., Kimpe, D., Jones, T., Knüpfer, A., Iskra, K., Ross, R., Nagel, W.E., Poole, S.: Enabling event tracing at leadership-class scale through I/O forwarding middleware. In: Epema, D.H.J., Kielmann, T., und Ripeanu, M. (Hrsg.) Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing. S. 49-60. Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing (2012).
We present a practical lock-free shared data structure that efficiently implements the operations of a concurrent deque as well as a general doubly linked list. The implementation supports parallelism for disjoint accesses and uses atomic primitives which are available in modern computer systems. Previously known lock-free algorithms of doubly linked lists are either based on non-available atomic synchronization primitives, only implement a subset of the functionality, or are not designed for disjoint accesses. Our algorithm only requires single-word compare-and-swap atomic primitives, supports fully dynamic list sizes, and allows traversal also through deleted nodes and thus avoids unnecessary operation retries. We have performed an empirical study of our new algorithm on two different multiprocessor platforms. Results of the experiments performed under high contention show that the performance of our implementation scales linearly with increasing number of processors. Considering deque implementations and systems with low concurrency, the algorithm by Michael shows the best performance. However, as our algorithm is designed for disjoint accesses, it performs significantly better on systems with high concurrency and non-uniform memory architecture.