Change search
Link to record
Permanent link

Direct link
BETA
Publications (10 of 88) Show all publications
Rivas Gomez, S., Markidis, S., Laure, E., Brabazon, K., Perks, O. & Narasimhamurthy, S. (2019). Decoupled Strategy for Imbalanced Workloads in MapReduce Frameworks. In: Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018: . Paper presented at 20th International Conference on High Performance Computing and Communications, 16th IEEE International Conference on Smart City and 4th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018, 28 June 2018 through 30 June 2018 (pp. 921-927). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Decoupled Strategy for Imbalanced Workloads in MapReduce Frameworks
Show others...
2019 (English)In: Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018, Institute of Electrical and Electronics Engineers (IEEE), 2019, p. 921-927Conference paper, Published paper (Refereed)
Abstract [en]

In this work, we consider the integration of MPI one-sided communication and non-blocking I/O in HPC-centric MapReduce frameworks. Using a decoupled strategy, we aim to overlap the Map and Reduce phases of the algorithm by allowing processes to communicate and synchronize using solely one-sided operations. Hence, we effectively increase the performance in situations where the workload per process becomes unexpectedly unbalanced. Using a Word-Count implementation and a large dataset from the Purdue MapReduce Benchmarks Suite (PUMA), we demonstrate that our approach can provide up to 23% performance improvement on average compared to a reference MapReduce implementation that uses state-of-the-art MPI collective communication and I/O.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2019
Keywords
High Performance Computing, MapReduce, MPI One Sided Communication
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-246358 (URN)10.1109/HPCC/SmartCity/DSS.2018.00153 (DOI)000468511200121 ()2-s2.0-85062487109 (Scopus ID)9781538666142 (ISBN)
Conference
20th International Conference on High Performance Computing and Communications, 16th IEEE International Conference on Smart City and 4th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018, 28 June 2018 through 30 June 2018
Note

QC 20190319

Available from: 2019-03-19 Created: 2019-03-19 Last updated: 2019-06-26Bibliographically approved
Simmendinger, C., Iakymchuk, R., Cebamanos, L., Akhmetova, D., Bartsch, V., Rotaru, T., . . . Markidis, S. (2019). Interoperability strategies for GASPI and MPI in large-scale scientific applications. The international journal of high performance computing applications, 33(3), 554-568
Open this publication in new window or tab >>Interoperability strategies for GASPI and MPI in large-scale scientific applications
Show others...
2019 (English)In: The international journal of high performance computing applications, ISSN 1094-3420, E-ISSN 1741-2846, Vol. 33, no 3, p. 554-568Article in journal (Refereed) Published
Abstract [en]

One of the main hurdles of partitioned global address space (PGAS) approaches is the dominance of message passing interface (MPI), which as a de facto standard appears in the code basis of many applications. To take advantage of the PGAS APIs like global address space programming interface (GASPI) without a major change in the code basis, interoperability between MPI and PGAS approaches needs to be ensured. In this article, we consider an interoperable GASPI/MPI implementation for the communication/performance crucial parts of the Ludwig and iPIC3D applications. To address the discovered performance limitations, we develop a novel strategy for significantly improved performance and interoperability between both APIs by leveraging GASPI shared windows and shared notifications. First results with a corresponding implementation in the MiniGhost proxy application and the Allreduce collective operation demonstrate the viability of this approach.

Place, publisher, year, edition, pages
SAGE PUBLICATIONS LTD, 2019
Keywords
Interoperability, GASPI, MPI, iPIC3D, Ludwig, MiniGhost, halo exchange, Allreduce
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-254034 (URN)10.1177/1094342018808359 (DOI)000468919900011 ()2-s2.0-85059353725 (Scopus ID)
Note

QC 20190814

Available from: 2019-08-14 Created: 2019-08-14 Last updated: 2019-08-14Bibliographically approved
Otero, E., Gong, J., Min, M., Fischer, P., Schlatter, P. & Laure, E. (2019). OpenACC acceleration for the PN-PN-2 algorithm in Nek5000. Journal of Parallel and Distributed Computing, 132, 69-78
Open this publication in new window or tab >>OpenACC acceleration for the PN-PN-2 algorithm in Nek5000
Show others...
2019 (English)In: Journal of Parallel and Distributed Computing, ISSN 0743-7315, E-ISSN 1096-0848, Vol. 132, p. 69-78Article in journal (Refereed) Published
Abstract [en]

Due to its high performance and throughput capabilities, GPU-accelerated computing is becoming a popular technology in scientific computing, in particular using programming models such as CUDA and OpenACC. The main advantage with OpenACC is that it enables to simply port codes in their "original" form to GPU systems through compiler directives, thus allowing an incremental approach. An OpenACC implementation is applied to the CFD code Nek5000 for simulation of incompressible flows, based on the spectral-element method. The work follows up previous implementations and focuses now on the P-N-PN-2 method for the spatial discretization of the Navier-Stokes equations. Performance results of the ported code show a speed-up of up to 3.1 on multi-GPU for a polynomial order N > 11.

Place, publisher, year, edition, pages
Academic Press, 2019
Keywords
Nek5000; OpenACC; GPU programming; Spectral element method; High performance computing
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-253811 (URN)10.1016/j.jpdc.2019.05.010 (DOI)000476580400006 ()2-s2.0-85066835225 (Scopus ID)
Funder
EU, Horizon 2020Swedish e‐Science Research CenterSwedish Foundation for Strategic Research
Note

QC 20190625

Available from: 2019-06-18 Created: 2019-06-18 Last updated: 2019-08-16Bibliographically approved
Narasimhamurthy, S., Danilov, N., Wu, S., Umanesan, G., Markidis, S., Rivas-Gomez, S., . . . de Witt, S. (2019). SAGE: Percipient Storage for Exascale Data Centric Computing. Parallel Computing, 83, 22-33
Open this publication in new window or tab >>SAGE: Percipient Storage for Exascale Data Centric Computing
Show others...
2019 (English)In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 83, p. 22-33Article in journal (Refereed) Published
Abstract [en]

We aim to implement a Big Data/Extreme Computing (BDEC) capable system infrastructure as we head towards the era of Exascale computing - termed SAGE (Percipient StorAGe for Exascale Data Centric Computing). The SAGE system will be capable of storing and processing immense volumes of data at the Exascale regime, and provide the capability for Exascale class applications to use such a storage infrastructure. SAGE addresses the increasing overlaps between Big Data Analysis and HPC in an era of next-generation data centric computing that has developed due to the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors, whose data needs to be processed, analysed and integrated into simulations to derive scientific and innovative insights. Indeed, Exascale I/O, as a problem that has not been sufficiently dealt with for simulation codes, is appropriately addressed by the SAGE platform. The objective of this paper is to discuss the software architecture of the SAGE system and look at early results we have obtained employing some of its key methodologies, as the system continues to evolve.

Place, publisher, year, edition, pages
Elsevier, 2019
Keywords
SAGE architecture, Object storage, Mero, Clovis, PGAS I/O, MPI I/O, MPI streams
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-254119 (URN)10.1016/j.parco.2018.03.002 (DOI)000469898400003 ()2-s2.0-85044917976 (Scopus ID)
Note

QC 20190624

Available from: 2019-06-24 Created: 2019-06-24 Last updated: 2019-06-24Bibliographically approved
Thoman, P., Dichev, K., Heller, T., Iakymchuk, R., Aguilar, X., Hasanov, K., . . . Nikolopoulos, D. S. (2018). A taxonomy of task-based parallel programming technologies for high-performance computing. Journal of Supercomputing, 74(4), 1422-1434
Open this publication in new window or tab >>A taxonomy of task-based parallel programming technologies for high-performance computing
Show others...
2018 (English)In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 74, no 4, p. 1422-1434Article in journal (Refereed) Published
Abstract [en]

Task-based programming models for shared memory-such as Cilk Plus and OpenMP 3-are well established and documented. However, with the increase in parallel, many-core, and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.

Place, publisher, year, edition, pages
SPRINGER, 2018
Keywords
High-performance computing, Task-based parallelism, Taxonomy, API, Runtime system, Scheduler, Monitoring framework, Fault tolerance
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-226199 (URN)10.1007/s11227-018-2238-4 (DOI)000428284000002 ()2-s2.0-85041817729 (Scopus ID)
Note

QC 20180518

Available from: 2018-05-18 Created: 2018-05-18 Last updated: 2019-08-20Bibliographically approved
Thoman, P., Hasanov, K., Dichev, K., Iakymchuk, R., Aguilar, X., Gschwandtner, P., . . . Fahringer, T. (2018). A Taxonomy of Task-Based Technologies for High-Performance Computing. In: Wyrzykowski, R Dongarra, J Deelman, E Karczewski, K (Ed.), PARALLEL PROCESSING AND APPLIED MATHEMATICS (PPAM 2017), PT II: . Paper presented at 12th International Conference on Parallel Processing and Applied Mathematics (PPAM), SEP 10-13, 2017, Lublin, POLAND (pp. 264-274). SPRINGER INTERNATIONAL PUBLISHING AG
Open this publication in new window or tab >>A Taxonomy of Task-Based Technologies for High-Performance Computing
Show others...
2018 (English)In: PARALLEL PROCESSING AND APPLIED MATHEMATICS (PPAM 2017), PT II / [ed] Wyrzykowski, R Dongarra, J Deelman, E Karczewski, K, SPRINGER INTERNATIONAL PUBLISHING AG , 2018, p. 264-274Conference paper, Published paper (Refereed)
Abstract [en]

Task-based programming models for shared memory - such as Cilk Plus and OpenMP 3 - are well established and documented. However, with the increase in heterogeneous, many-core and parallel systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing, no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.

Place, publisher, year, edition, pages
SPRINGER INTERNATIONAL PUBLISHING AG, 2018
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 10778
Keywords
Task-based parallelism, Taxonomy, API, Runtime system, Scheduler, Monitoring framework, Fault tolerance
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-245025 (URN)10.1007/978-3-319-78054-2_25 (DOI)000458563900025 ()2-s2.0-85044764286 (Scopus ID)978-3-319-78054-2 (ISBN)
Conference
12th International Conference on Parallel Processing and Applied Mathematics (PPAM), SEP 10-13, 2017, Lublin, POLAND
Note

QC 20190305

Available from: 2019-03-05 Created: 2019-03-05 Last updated: 2019-03-05Bibliographically approved
Chien, S. W. D., Markidis, S., Sishtla, C. P., Santos, L., Herman, P., Nrasimhamurthy, S. & Laure, E. (2018). Characterizing Deep-Learning I/O Workloads in TensorFlow. In: Proceedings of PDSW-DISCS 2018: 3rd Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis: . Paper presented at 3rd IEEE/ACM Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS 2018; Dallas; United States; 12 November 2018 (pp. 54-63). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Characterizing Deep-Learning I/O Workloads in TensorFlow
Show others...
2018 (English)In: Proceedings of PDSW-DISCS 2018: 3rd Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis, Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 54-63Conference paper, Published paper (Refereed)
Abstract [en]

The performance of Deep-Learning (DL) computing frameworks rely on the rformance of data ingestion and checkpointing. In fact, during the aining, a considerable high number of relatively small files are first aded and pre-processed on CPUs and then moved to accelerator for mputation. In addition, checkpointing and restart operations are rried out to allow DL computing frameworks to restart quickly from a eckpoint. Because of this, I/O affects the performance of DL plications. this work, we characterize the I/O performance and scaling of nsorFlow, an open-source programming framework developed by Google and ecifically designed for solving DL problems. To measure TensorFlow I/O rformance, we first design a micro-benchmark to measure TensorFlow ads, and then use a TensorFlow mini-application based on AlexNet to asure the performance cost of I/O and checkpointing in TensorFlow. To prove the checkpointing performance, we design and implement a burst ffer. find that increasing the number of threads increases TensorFlow ndwidth by a maximum of 2.3 x and 7.8 x on our benchmark environments. e use of the tensorFlow prefetcher results in a complete overlap of mputation on accelerator and input pipeline on CPU eliminating the fective cost of I/O on the overall performance. The use of a burst ffer to checkpoint to a fast small capacity storage and copy ynchronously the checkpoints to a slower large capacity storage sulted in a performance improvement of 2.6x with respect to eckpointing directly to slower storage on our benchmark environment.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Keywords
Parallel I/O, Input Pipeline, Deep Learning, TensorFlow
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-248377 (URN)10.1109/PDSW-DISCS.2018.00011 (DOI)000462205000006 ()2-s2.0-85063062239 (Scopus ID)
Conference
3rd IEEE/ACM Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS 2018; Dallas; United States; 12 November 2018
Note

QC 20190405

Available from: 2019-04-05 Created: 2019-04-05 Last updated: 2019-04-05Bibliographically approved
Peng, I. B., Gioiosa, R., Kestor, G., Vetter, J. S., Cicotti, P., Laure, E. & Markidis, S. (2018). Characterizing the performance benefit of hybrid memory system for HPC applications. Parallel Computing, 76, 57-69
Open this publication in new window or tab >>Characterizing the performance benefit of hybrid memory system for HPC applications
Show others...
2018 (English)In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 76, p. 57-69Article in journal (Refereed) Published
Abstract [en]

Heterogenous memory systems that consist of multiple memory technologies are becoming common in high-performance computing environments. Modern processors and accelerators, such as the Intel Knights Landing (KNL) CPU and NVIDIA Volta GPU, feature small-size high-bandwidth memory near the compute cores and large-size normal-bandwidth memory that is connected off-chip. Theoretically, HBM can provide about four times higher bandwidth than conventional DRAM. However, many factors impact the actual performance improvement that an application can achieve on such system. In this paper, we focus on the Intel KNL system and identify the most important factors on the application performance, including the application memory access pattern, the problem size, the threading level and the actual memory configuration. We use a set of representative applications from both scientific and data-analytics domains. Our results show that applications with regular memory access benefit from MCDRAM, achieving up to three times performance when compared to the performance obtained using only DRAM. On the contrary, applications with irregular memory access pattern are latency-bound and may suffer from performance degradation when using only MCDRAM. Also, we provide memory-centric analysis of four applications, identify their major data objects, correlate their characteristics to the performance improvement on the testbed.

Place, publisher, year, edition, pages
Elsevier, 2018
Keywords
Heterogenous memory system, Intel Knights Landing (KNL) processor, MCDRAM, Memory-centric profiling
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-229249 (URN)10.1016/j.parco.2018.04.007 (DOI)000446404600005 ()2-s2.0-85047113223 (Scopus ID)
Funder
EU, European Research Council
Note

QC 20180601

Available from: 2018-06-01 Created: 2018-06-01 Last updated: 2018-10-23Bibliographically approved
Al Ahad, M. A., Simmendinger, C., Iakymchuk, R., Laure, E. & Markidis, S. (2018). Efficient Algorithms for Collective Operations with Notified Communication in Shared Windows. In: PROCEEDINGS OF PAW-ATM18: 2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM). Paper presented at 2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM) (pp. 1-10). IEEE
Open this publication in new window or tab >>Efficient Algorithms for Collective Operations with Notified Communication in Shared Windows
Show others...
2018 (English)In: PROCEEDINGS OF PAW-ATM18: 2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM), IEEE , 2018, p. 1-10Conference paper, Published paper (Refereed)
Abstract [en]

Collective operations are commonly used in various parts of scientific applications. Especially in strong scaling scenarios collective operations can negatively impact the overall applications performance: while the load per rank here decreases with increasing core counts, time spent in e.g. barrier operations will increase logarithmically with the core count. In this article, we develop novel algorithmic solutions for collective operations such as Allreduce and Allgather(V)-by leveraging notified communication in shared windows. To this end, we have developed an extension of GASPI which enables all ranks participating in a shared window to observe the entire notified communication targeted at the window. By exploring benefits of this extension, we deliver high performing implementations of Allreduce and Allgather(V) on Intel and Cray clusters. These implementations clearly achieve 2x-4x performance improvements compared to the best performing MPI implementations for various data distributions.

Place, publisher, year, edition, pages
IEEE, 2018
Keywords
Collectives, Allreduce, Allgather, AllgatherV, MPI, PGAS, GASPI, shared windows, shared notifications
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-249835 (URN)10.1109/PAW-ATM.2018.00006 (DOI)000462965600001 ()2-s2.0-85063078028 (Scopus ID)
Conference
2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM)
Note

QC 20190423

Available from: 2019-04-23 Created: 2019-04-23 Last updated: 2019-04-23Bibliographically approved
Ahmed, L., Georgiev, V., Capuccini, M., Toor, S., Schaal, W., Laure, E. & Spjuth, O. (2018). Efficient iterative virtual screening with Apache Spark and conformal prediction. Journal of Cheminformatics, 10, Article ID 8.
Open this publication in new window or tab >>Efficient iterative virtual screening with Apache Spark and conformal prediction
Show others...
2018 (English)In: Journal of Cheminformatics, ISSN 1758-2946, E-ISSN 1758-2946, Vol. 10, article id 8Article in journal (Refereed) Published
Abstract [en]

Background: Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands. Contribution: In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as 'low-scoring' ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling. Results: We show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub (https://github.com/laeeq80/spark-cpvs) and can be run on high-performance computers as well as on cloud resources.

Place, publisher, year, edition, pages
BioMed Central, 2018
Keywords
Virtual screening, Docking, Conformal prediction, Cloud computing, Apache Spark
National Category
Chemical Sciences Computer Sciences
Identifiers
urn:nbn:se:kth:diva-224683 (URN)10.1186/s13321-018-0265-z (DOI)000426699400001 ()29492726 (PubMedID)2-s2.0-85042857389 (Scopus ID)
Funder
Swedish e‐Science Research CenterSwedish National Infrastructure for Computing (SNIC)
Note

QC 20180323

Available from: 2018-03-23 Created: 2018-03-23 Last updated: 2018-03-23Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-9901-9857

Search in DiVA

Show all publications