Change search
Link to record
Permanent link

Direct link
BETA
Schliephake, MichaelORCID iD iconorcid.org/0000-0002-5415-1248
Publications (10 of 13) Show all publications
Gong, J., Markidis, S., Schliephake, M., Laure, E., Henningson, D., Schlatter, P., . . . Fischer, P. (2015). Nek5000 with OpenACC. In: Solving software challenges for exascale: . Paper presented at 2nd International Conference on Exascale Applications and Software (EASC), APR 02-03, 2014, Stockholm, SWEDEN (pp. 57-68).
Open this publication in new window or tab >>Nek5000 with OpenACC
Show others...
2015 (English)In: Solving software challenges for exascale, 2015, p. 57-68Conference paper, Published paper (Refereed)
Abstract [en]

Nek5000 is a computational fluid dynamics code based on the spectral element method used for the simulation of incompressible flows. We follow up on an earlier study which ported the simplified version of Nek5000 to a GPU-accelerated system by presenting the hybrid CPU/GPU implementation of the full Nek5000 code using OpenACC. The matrix-matrix multiplication, the Nek5000 gather-scatter operator and a preconditioned Conjugate Gradient solver have implemented using OpenACC for multi-GPU systems. We report an speed-up of 1.3 on single node of a Cray XK6 when using OpenACC directives in Nek5000. On 512 nodes of the Titan supercomputer, the speed-up can be approached to 1.4. A performance analysis of the Nek5000 code using Score-P and Vampir performance monitoring tools shows that overlapping of GPU kernels with host-accelerator memory transfers would considerably increase the performance of the OpenACC version of Nek5000 code.

Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 8759
Keywords
GPU programming, Nek5000, OpenACC, Spectral element method
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-170716 (URN)10.1007/978-3-319-15976-8_4 (DOI)000355749700004 ()2-s2.0-84928882903 (Scopus ID)978-3-319-15975-1 (ISBN)978-3-319-15976-8 (ISBN)
Conference
2nd International Conference on Exascale Applications and Software (EASC), APR 02-03, 2014, Stockholm, SWEDEN
Note

QC 20150706

Available from: 2015-07-06 Created: 2015-07-03 Last updated: 2018-01-11Bibliographically approved
Markidis, S., Gong, J., Schliephake, M., Laure, E., Hart, A., Henty, D., . . . Fischer, P. (2015). OpenACC acceleration of the Nek5000 spectral element code. The international journal of high performance computing applications, 29(3), 311-319
Open this publication in new window or tab >>OpenACC acceleration of the Nek5000 spectral element code
Show others...
2015 (English)In: The international journal of high performance computing applications, ISSN 1094-3420, E-ISSN 1741-2846, Vol. 29, no 3, p. 311-319Article in journal (Refereed) Published
Abstract [en]

We present a case study of porting NekBone, a skeleton version of the Nek5000 code, to a parallel GPU-accelerated system. Nek5000 is a computational fluid dynamics code based on the spectral element method used for the simulation of incompressible flow. The original NekBone Fortran source code has been used as the base and enhanced by OpenACC directives. The profiling of NekBone provided an assessment of the suitability of the code for GPU systems, and indicated possible kernel optimizations. To port NekBone to GPU systems required little effort and a small number of additional lines of code (approximately one OpenACC directive per 1000 lines of code). The naïve implementation using OpenACC leads to little performance improvement: on a single node, from 16 Gflops obtained with the version without OpenACC, we reached 20 Gflops with the naïve OpenACC implementation. An optimized NekBone version leads to a 43 Gflop performance on a single node. In addition, we ported and optimized NekBone to parallel GPU systems, reaching a parallel efficiency of 79.9% on 1024 GPUs of the Titan XK7 supercomputer at the Oak Ridge National Laboratory.

Place, publisher, year, edition, pages
Sage Publications, 2015
National Category
Computer Sciences Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-171357 (URN)10.1177/1094342015576846 (DOI)000358414200006 ()2-s2.0-84938095938 (Scopus ID)
Funder
Swedish e‐Science Research Center
Note

QC 20150804

Available from: 2015-07-27 Created: 2015-07-27 Last updated: 2018-01-11Bibliographically approved
Schliephake, M. & Laure, E. (2015). Performance Analysis of Irregular Collective Communication with the Crystal Router Algorithm. In: Solving software challenges for exascale: . Paper presented at 2nd International Conference on Exascale Applications and Software (EASC), APR 02-03, 2014, Stockholm, SWEDEN (pp. 130-140).
Open this publication in new window or tab >>Performance Analysis of Irregular Collective Communication with the Crystal Router Algorithm
2015 (English)In: Solving software challenges for exascale, 2015, p. 130-140Conference paper, Published paper (Refereed)
Abstract [en]

In order to achieve exascale performance it is important to detect potential bottlenecks and identify strategies to overcome them. For this, both applications and system software must be analysed and potentially improved. The EU FP7 project Collaborative Research into Exascale Systemware, Tools & Applications (CRESTA) chose the approach to co-design advanced simulation applications and system software as well as development tools. In this paper, we present the results of a co-design activity focused on the simulation code NEK5000 that aims at performance improvements of collective communication operations. We have analysed the algorithms that form the core of NEK5000's communication module in order to assess its viability on recent computer architectures before starting to improve its performance. Our results show that the crystal router algorithm performs well in sparse, irregular collective operations for medium and large processor number but improvements for even larger system sizes of the future will be needed. We sketch the needed improvements, which will make the communication algorithms also beneficial for other applications that need to implement latency-dominated communication schemes with short messages. The latency-optimised communication operations will also become used in a runtime-system providing dynamic load balancing, under development within CRESTA.

Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 8759
Keywords
Collective operations, MPI, Performance tuning
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-170717 (URN)10.1007/978-3-319-15976-8_10 (DOI)000355749700010 ()2-s2.0-84928920465 (Scopus ID)978-3-319-15975-1 (ISBN)978-3-319-15976-8 (ISBN)
Conference
2nd International Conference on Exascale Applications and Software (EASC), APR 02-03, 2014, Stockholm, SWEDEN
Note

QC 20150706

Available from: 2015-07-06 Created: 2015-07-03 Last updated: 2018-01-11Bibliographically approved
Schliephake, M., Laure, E., Heisey, K. & Fischer, P. (2013). Design, implementation and use of mampicl, the multi-algorithm MPI collective library. In: : . Paper presented at Exascale Applications and Software Conference EASC 2013; Edinburgh, Scotland, UK.
Open this publication in new window or tab >>Design, implementation and use of mampicl, the multi-algorithm MPI collective library
2013 (English)Conference paper, Oral presentation with published abstract (Other academic)
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-139549 (URN)
Conference
Exascale Applications and Software Conference EASC 2013; Edinburgh, Scotland, UK
Note

QC 20140618

Available from: 2014-01-15 Created: 2014-01-15 Last updated: 2018-01-11Bibliographically approved
Gong, J., Hart, A., Henty, D., Markidis, S., Schliephake, M., Fischer, P. & Heisey, K. (2013). OpenACC Acceleration of Nek5000: a Spectral Element Code. In: : . Paper presented at Exascale Applications and Software Conference; Edinburgh, Scotland, UK, 9-11 April 2013.
Open this publication in new window or tab >>OpenACC Acceleration of Nek5000: a Spectral Element Code
Show others...
2013 (English)Conference paper, Oral presentation with published abstract (Other academic)
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-139546 (URN)
Conference
Exascale Applications and Software Conference; Edinburgh, Scotland, UK, 9-11 April 2013
Note

QC 20140130

Available from: 2014-01-15 Created: 2014-01-15 Last updated: 2018-01-11Bibliographically approved
Markidis, S., Schliephake, M., Aguilar, X., Henty, D., Richardson, H., Hart, A., . . . Laure, E. (2013). Paving the path to exascale computing with CRESTA development environment. In: : . Paper presented at Exascale Software and Applications Conference.
Open this publication in new window or tab >>Paving the path to exascale computing with CRESTA development environment
Show others...
2013 (English)Conference paper, Oral presentation with published abstract (Other academic)
Abstract [en]

The development and implementation of efficient computer codes for exascale supercomputers will require combined advancement of all development environment components: compilers, automatic tuning frameworks, run-time systems, debuggers and performance monitoring and analysis tools. The exascale era poses unprecedented challenges. Because the presence of accelerators is more and more common among the fastest supercomputer and will play a role in exascale computing, compilers will need to support hybrid computer architectures and generate efficient code hiding the complexity of programming accelerators. Hand optimization of the code will be very difficult on exascale machine and will be increasingly assisted by automatic tuners. Application tuning will be more focus on parallel aspects of the computation because of large amount of available parallelism. The application workload will be distributed over million of processes, and to implement ad-hoc strategies directly in the application will be probably unfeasible while an adaptive run-time system will provide automatic load balancing. Debuggers and performance monitoring tools will deal with million processes and with huge amount of data from application and hardware counters, but they will still be required to minimize the overhead and retain scalability. In this talk, we present how the development environment of the CRESTA exascale EC project meets all these challenges by advancing the state of the art in the field.

An investigation of compiler support for hybrid GPU programming, the design concepts, and the main characteristics of the alpha prototype implementation of the CRESTA development environment components for exascale computing are presented. A performance study of OpenACC compiler directives has been carried out, showing very promising results and indicating OpenACC as viable approach for programming hybrid exascale supercomputer. A new Domain-Specific Language (DSL) has been defined for the expression of parallel auto-tuning at very large scale. The focus of on the extension of the auto-tuning approach into the parallel domain to enable tuning of communication-related aspects of application. A new adaptive run-time system has been designed to schedule processes depending on the resource availability, on the workload, and on the run-time analysis of the application performance. The Allinea DDT debugger and the Dresden University of Technology MUST MPI correctness checker are being extended to provide a unified interface, to improve scalability, and to include new disruptive technology based on statistical analysis of run-time behavior of the application for anomalies detection. The new exascale prototypes of the Dresden University of Technology Vampir, VampirTrace and Score-P performance monitoring and analysis tools have been released. The new features include the possibility of applying filtering technique before loading performance data to drastically reduce memory needs during the performance analysis. The initial evaluation study of the development environment is targeted on the CRESTA project applications to determine how the development environment could be coupled into a production suite for exascale computing.

National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-139548 (URN)
Conference
Exascale Software and Applications Conference
Note

QC 20140624

Available from: 2014-01-15 Created: 2014-01-15 Last updated: 2018-01-11Bibliographically approved
Aguilar, X., Schliephake, M., Vahtras, O., Gimenez, J. & Laure, E. (2013). Scalability analysis of Dalton, a molecular structure program. Future generations computer systems, 29(8), 2197-2204
Open this publication in new window or tab >>Scalability analysis of Dalton, a molecular structure program
Show others...
2013 (English)In: Future generations computer systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 29, no 8, p. 2197-2204Article in journal (Refereed) Published
Abstract [en]

Dalton is a molecular electronic structure program featuring common methods of computational chemistry that are based on pure quantum mechanics (QM) as well as hybrid quantum mechanics/molecular mechanics (QM/MM). It is specialized and has a leading position in calculation of molecular properties with a large world-wide user community (over 2000 licenses issued). In this paper, we present a performance characterization and optimization of Dalton. We also propose a solution to avoid the master/worker design of Dalton to become a performance bottleneck for larger process numbers. With these improvements we obtain speedups of 4x, increasing the parallel efficiency of the code and being able to run in it in a much bigger number of cores.

Keywords
Performance analysis, Optimization, Scalability
National Category
Computer Systems
Research subject
SRA - E-Science (SeRC)
Identifiers
urn:nbn:se:kth:diva-136200 (URN)10.1016/j.future.2013.04.013 (DOI)000326613400028 ()2-s2.0-84886093468 (Scopus ID)
Funder
Swedish e‐Science Research Center
Note

QC 20131216

Available from: 2013-12-04 Created: 2013-12-04 Last updated: 2017-12-06Bibliographically approved
Schliephake, M. & Laure, E. (2012). Communication Performance Analysis of CRESTA’s Co-Design Application NEK5000. In: Workshop Preparing Applications for Exascale Through Co-design in International Conference on High Performance Computing. Paper presented at International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12.
Open this publication in new window or tab >>Communication Performance Analysis of CRESTA’s Co-Design Application NEK5000
2012 (English)In: Workshop Preparing Applications for Exascale Through Co-design in International Conference on High Performance Computing, 2012Conference paper, Published paper (Refereed)
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-116369 (URN)
Conference
International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12
Funder
EU, FP7, Seventh Framework ProgrammeSwedish e‐Science Research Center
Note

QC 20130118

Available from: 2013-01-17 Created: 2013-01-17 Last updated: 2018-01-11Bibliographically approved
Schliephake, M. & Laure, E. (2012). Towards improving the communication performance of CRESTA's co-design application NEK5000. In: Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012: . Paper presented at 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012; Salt Lake City, UT; United States; 10 November 2012 through 16 November 2012 (pp. 669-674). IEEE
Open this publication in new window or tab >>Towards improving the communication performance of CRESTA's co-design application NEK5000
2012 (English)In: Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012, IEEE , 2012, p. 669-674Conference paper, Published paper (Refereed)
Abstract [en]

In order to achieve exascale performance, all aspects of applications and system software need to be analysed and potentially improved. The EU FP7 project 'Collaborative Research into Exascale Systemware, Tools & Applications' (CRESTA) uses co-design of advanced simulation applications and system software as well as related development tools as a key element in its approach towards exascale. In this paper we present first results of a co-design activity using the highly scalable application NEK5000. We have analysed the communication structure of NEK5000 and propose new, optimised collective communication operations that will allow to improve the performance of NEK5000 and to prepare it for the use on several millions of cores available in future HPC systems. The latency-optimised communication operations can also be beneficial in other contexts, for instance we expect them to become an important building block for a runtime-system providing dynamic load balancing, also under development within CRESTA.

Place, publisher, year, edition, pages
IEEE, 2012
Keywords
collective communication operations, CRESTA, exascale, MPI, NEK5000
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-128977 (URN)10.1109/SC.Companion.2012.92 (DOI)000320824300082 ()2-s2.0-84876517304 (Scopus ID)978-076954956-9 (ISBN)
Conference
2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012; Salt Lake City, UT; United States; 10 November 2012 through 16 November 2012
Funder
Swedish e‐Science Research Center
Note

QC 20130918

Available from: 2013-09-18 Created: 2013-09-17 Last updated: 2018-01-11Bibliographically approved
Schliephake, M., Aguilar, X. & Laure, E. (2011). Design and Implementation of a Runtime System for Parallel Numerical Simulations on Large-Scale Clusters. In: Sato, M; Matsuoka, S; Sloot, PMA; VanAlbada, GD; Dongarra, J (Ed.), Proceedings Of The International Conference On Computational Science (ICCS). Paper presented at 11th International Conference on Computational Science, ICCS 2011. Singapore. 1 June 2011 - 3 June 2011 (pp. 2105-2114). Elsevier, 4
Open this publication in new window or tab >>Design and Implementation of a Runtime System for Parallel Numerical Simulations on Large-Scale Clusters
2011 (English)In: Proceedings Of The International Conference On Computational Science (ICCS) / [ed] Sato, M; Matsuoka, S; Sloot, PMA; VanAlbada, GD; Dongarra, J, Elsevier, 2011, Vol. 4, p. 2105-2114Conference paper, Published paper (Refereed)
Abstract [en]

The execution of scientific codes will introduce a number of new challenges and intensify some old ones on new high-performance computing infrastructures. Petascale computers are large systems with complex designs using heterogeneous technologies that make the programming and porting of applications difficult, particularly if one wants to use the maximum peak performance of the system. In this paper we present the design and first prototype of a runtime system for parallel numerical simulations on large-scale systems. The proposed runtime system addresses the challenges of performance, scalability, and programmability of large-scale HPC systems. We also present initial results of our prototype implementation using a molecular dynamics application kernel.

Place, publisher, year, edition, pages
Elsevier, 2011
Series
Procedia Computer Science, ISSN 1877-0509 ; 4
Keywords
Hybrid computational methods, Parallel computing, Advanced computing architectures, Runtime systems
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-38886 (URN)10.1016/j.procs.2011.04.230 (DOI)000299165200229 ()2-s2.0-79958278307 (Scopus ID)
Conference
11th International Conference on Computational Science, ICCS 2011. Singapore. 1 June 2011 - 3 June 2011
Funder
Swedish e‐Science Research Center
Note
QC 20120110Available from: 2011-09-02 Created: 2011-09-02 Last updated: 2018-01-12Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-5415-1248

Search in DiVA

Show all publications