Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 85) Show all publications
Rivas Gomez, S., Markidis, S., Laure, E., Brabazon, K., Perks, O. & Narasimhamurthy, S. (2019). Decoupled Strategy for Imbalanced Workloads in MapReduce Frameworks. In: Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018: . Paper presented at 20th International Conference on High Performance Computing and Communications, 16th IEEE International Conference on Smart City and 4th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018, 28 June 2018 through 30 June 2018 (pp. 921-927). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Decoupled Strategy for Imbalanced Workloads in MapReduce Frameworks
Show others...
2019 (English)In: Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018, Institute of Electrical and Electronics Engineers (IEEE), 2019, p. 921-927Conference paper, Published paper (Refereed)
Abstract [en]

In this work, we consider the integration of MPI one-sided communication and non-blocking I/O in HPC-centric MapReduce frameworks. Using a decoupled strategy, we aim to overlap the Map and Reduce phases of the algorithm by allowing processes to communicate and synchronize using solely one-sided operations. Hence, we effectively increase the performance in situations where the workload per process becomes unexpectedly unbalanced. Using a Word-Count implementation and a large dataset from the Purdue MapReduce Benchmarks Suite (PUMA), we demonstrate that our approach can provide up to 23% performance improvement on average compared to a reference MapReduce implementation that uses state-of-the-art MPI collective communication and I/O.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2019
Keywords
High Performance Computing, MapReduce, MPI One Sided Communication
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-246358 (URN)10.1109/HPCC/SmartCity/DSS.2018.00153 (DOI)000468511200121 ()2-s2.0-85062487109 (Scopus ID)9781538666142 (ISBN)
Conference
20th International Conference on High Performance Computing and Communications, 16th IEEE International Conference on Smart City and 4th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018, 28 June 2018 through 30 June 2018
Note

QC 20190319

Available from: 2019-03-19 Created: 2019-03-19 Last updated: 2019-06-26Bibliographically approved
Narasimhamurthy, S., Danilov, N., Wu, S., Umanesan, G., Markidis, S., Rivas-Gomez, S., . . . de Witt, S. (2019). SAGE: Percipient Storage for Exascale Data Centric Computing. Parallel Computing, 83, 22-33
Open this publication in new window or tab >>SAGE: Percipient Storage for Exascale Data Centric Computing
Show others...
2019 (English)In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 83, p. 22-33Article in journal (Refereed) Published
Abstract [en]

We aim to implement a Big Data/Extreme Computing (BDEC) capable system infrastructure as we head towards the era of Exascale computing - termed SAGE (Percipient StorAGe for Exascale Data Centric Computing). The SAGE system will be capable of storing and processing immense volumes of data at the Exascale regime, and provide the capability for Exascale class applications to use such a storage infrastructure. SAGE addresses the increasing overlaps between Big Data Analysis and HPC in an era of next-generation data centric computing that has developed due to the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors, whose data needs to be processed, analysed and integrated into simulations to derive scientific and innovative insights. Indeed, Exascale I/O, as a problem that has not been sufficiently dealt with for simulation codes, is appropriately addressed by the SAGE platform. The objective of this paper is to discuss the software architecture of the SAGE system and look at early results we have obtained employing some of its key methodologies, as the system continues to evolve.

Place, publisher, year, edition, pages
Elsevier, 2019
Keywords
SAGE architecture, Object storage, Mero, Clovis, PGAS I/O, MPI I/O, MPI streams
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-254119 (URN)10.1016/j.parco.2018.03.002 (DOI)000469898400003 ()2-s2.0-85044917976 (Scopus ID)
Note

QC 20190624

Available from: 2019-06-24 Created: 2019-06-24 Last updated: 2019-06-24Bibliographically approved
Thoman, P., Dichev, K., Heller, T., Iakymchuk, R., Aguilar, X., Hasanov, K., . . . Nikolopoulos, D. S. (2018). A taxonomy of task-based parallel programming technologies for high-performance computing. Journal of Supercomputing, 74(4), 1422-1434
Open this publication in new window or tab >>A taxonomy of task-based parallel programming technologies for high-performance computing
Show others...
2018 (English)In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 74, no 4, p. 1422-1434Article in journal (Refereed) Published
Abstract [en]

Task-based programming models for shared memory-such as Cilk Plus and OpenMP 3-are well established and documented. However, with the increase in parallel, many-core, and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.

Place, publisher, year, edition, pages
SPRINGER, 2018
Keywords
High-performance computing, Task-based parallelism, Taxonomy, API, Runtime system, Scheduler, Monitoring framework, Fault tolerance
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-226199 (URN)10.1007/s11227-018-2238-4 (DOI)000428284000002 ()2-s2.0-85041817729 (Scopus ID)
Note

QC 20180518

Available from: 2018-05-18 Created: 2018-05-18 Last updated: 2018-05-18Bibliographically approved
Thoman, P., Hasanov, K., Dichev, K., Iakymchuk, R., Aguilar, X., Gschwandtner, P., . . . Fahringer, T. (2018). A Taxonomy of Task-Based Technologies for High-Performance Computing. In: Wyrzykowski, R Dongarra, J Deelman, E Karczewski, K (Ed.), PARALLEL PROCESSING AND APPLIED MATHEMATICS (PPAM 2017), PT II: . Paper presented at 12th International Conference on Parallel Processing and Applied Mathematics (PPAM), SEP 10-13, 2017, Lublin, POLAND (pp. 264-274). SPRINGER INTERNATIONAL PUBLISHING AG
Open this publication in new window or tab >>A Taxonomy of Task-Based Technologies for High-Performance Computing
Show others...
2018 (English)In: PARALLEL PROCESSING AND APPLIED MATHEMATICS (PPAM 2017), PT II / [ed] Wyrzykowski, R Dongarra, J Deelman, E Karczewski, K, SPRINGER INTERNATIONAL PUBLISHING AG , 2018, p. 264-274Conference paper, Published paper (Refereed)
Abstract [en]

Task-based programming models for shared memory - such as Cilk Plus and OpenMP 3 - are well established and documented. However, with the increase in heterogeneous, many-core and parallel systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing, no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.

Place, publisher, year, edition, pages
SPRINGER INTERNATIONAL PUBLISHING AG, 2018
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 10778
Keywords
Task-based parallelism, Taxonomy, API, Runtime system, Scheduler, Monitoring framework, Fault tolerance
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-245025 (URN)10.1007/978-3-319-78054-2_25 (DOI)000458563900025 ()2-s2.0-85044764286 (Scopus ID)978-3-319-78054-2 (ISBN)
Conference
12th International Conference on Parallel Processing and Applied Mathematics (PPAM), SEP 10-13, 2017, Lublin, POLAND
Note

QC 20190305

Available from: 2019-03-05 Created: 2019-03-05 Last updated: 2019-03-05Bibliographically approved
Chien, S. W. D., Markidis, S., Sishtla, C. P., Santos, L., Herman, P., Nrasimhamurthy, S. & Laure, E. (2018). Characterizing Deep-Learning I/O Workloads in TensorFlow. In: Proceedings of PDSW-DISCS 2018: 3rd Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis: . Paper presented at 3rd IEEE/ACM Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS 2018; Dallas; United States; 12 November 2018 (pp. 54-63). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Characterizing Deep-Learning I/O Workloads in TensorFlow
Show others...
2018 (English)In: Proceedings of PDSW-DISCS 2018: 3rd Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis, Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 54-63Conference paper, Published paper (Refereed)
Abstract [en]

The performance of Deep-Learning (DL) computing frameworks rely on the rformance of data ingestion and checkpointing. In fact, during the aining, a considerable high number of relatively small files are first aded and pre-processed on CPUs and then moved to accelerator for mputation. In addition, checkpointing and restart operations are rried out to allow DL computing frameworks to restart quickly from a eckpoint. Because of this, I/O affects the performance of DL plications. this work, we characterize the I/O performance and scaling of nsorFlow, an open-source programming framework developed by Google and ecifically designed for solving DL problems. To measure TensorFlow I/O rformance, we first design a micro-benchmark to measure TensorFlow ads, and then use a TensorFlow mini-application based on AlexNet to asure the performance cost of I/O and checkpointing in TensorFlow. To prove the checkpointing performance, we design and implement a burst ffer. find that increasing the number of threads increases TensorFlow ndwidth by a maximum of 2.3 x and 7.8 x on our benchmark environments. e use of the tensorFlow prefetcher results in a complete overlap of mputation on accelerator and input pipeline on CPU eliminating the fective cost of I/O on the overall performance. The use of a burst ffer to checkpoint to a fast small capacity storage and copy ynchronously the checkpoints to a slower large capacity storage sulted in a performance improvement of 2.6x with respect to eckpointing directly to slower storage on our benchmark environment.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Keywords
Parallel I/O, Input Pipeline, Deep Learning, TensorFlow
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-248377 (URN)10.1109/PDSW-DISCS.2018.00011 (DOI)000462205000006 ()2-s2.0-85063062239 (Scopus ID)
Conference
3rd IEEE/ACM Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS 2018; Dallas; United States; 12 November 2018
Note

QC 20190405

Available from: 2019-04-05 Created: 2019-04-05 Last updated: 2019-04-05Bibliographically approved
Peng, I. B., Gioiosa, R., Kestor, G., Vetter, J. S., Cicotti, P., Laure, E. & Markidis, S. (2018). Characterizing the performance benefit of hybrid memory system for HPC applications. Parallel Computing, 76, 57-69
Open this publication in new window or tab >>Characterizing the performance benefit of hybrid memory system for HPC applications
Show others...
2018 (English)In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 76, p. 57-69Article in journal (Refereed) Published
Abstract [en]

Heterogenous memory systems that consist of multiple memory technologies are becoming common in high-performance computing environments. Modern processors and accelerators, such as the Intel Knights Landing (KNL) CPU and NVIDIA Volta GPU, feature small-size high-bandwidth memory near the compute cores and large-size normal-bandwidth memory that is connected off-chip. Theoretically, HBM can provide about four times higher bandwidth than conventional DRAM. However, many factors impact the actual performance improvement that an application can achieve on such system. In this paper, we focus on the Intel KNL system and identify the most important factors on the application performance, including the application memory access pattern, the problem size, the threading level and the actual memory configuration. We use a set of representative applications from both scientific and data-analytics domains. Our results show that applications with regular memory access benefit from MCDRAM, achieving up to three times performance when compared to the performance obtained using only DRAM. On the contrary, applications with irregular memory access pattern are latency-bound and may suffer from performance degradation when using only MCDRAM. Also, we provide memory-centric analysis of four applications, identify their major data objects, correlate their characteristics to the performance improvement on the testbed.

Place, publisher, year, edition, pages
Elsevier, 2018
Keywords
Heterogenous memory system, Intel Knights Landing (KNL) processor, MCDRAM, Memory-centric profiling
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-229249 (URN)10.1016/j.parco.2018.04.007 (DOI)000446404600005 ()2-s2.0-85047113223 (Scopus ID)
Funder
EU, European Research Council
Note

QC 20180601

Available from: 2018-06-01 Created: 2018-06-01 Last updated: 2018-10-23Bibliographically approved
Al Ahad, M. A., Simmendinger, C., Iakymchuk, R., Laure, E. & Markidis, S. (2018). Efficient Algorithms for Collective Operations with Notified Communication in Shared Windows. In: PROCEEDINGS OF PAW-ATM18: 2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM). Paper presented at 2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM) (pp. 1-10). IEEE
Open this publication in new window or tab >>Efficient Algorithms for Collective Operations with Notified Communication in Shared Windows
Show others...
2018 (English)In: PROCEEDINGS OF PAW-ATM18: 2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM), IEEE , 2018, p. 1-10Conference paper, Published paper (Refereed)
Abstract [en]

Collective operations are commonly used in various parts of scientific applications. Especially in strong scaling scenarios collective operations can negatively impact the overall applications performance: while the load per rank here decreases with increasing core counts, time spent in e.g. barrier operations will increase logarithmically with the core count. In this article, we develop novel algorithmic solutions for collective operations such as Allreduce and Allgather(V)-by leveraging notified communication in shared windows. To this end, we have developed an extension of GASPI which enables all ranks participating in a shared window to observe the entire notified communication targeted at the window. By exploring benefits of this extension, we deliver high performing implementations of Allreduce and Allgather(V) on Intel and Cray clusters. These implementations clearly achieve 2x-4x performance improvements compared to the best performing MPI implementations for various data distributions.

Place, publisher, year, edition, pages
IEEE, 2018
Keywords
Collectives, Allreduce, Allgather, AllgatherV, MPI, PGAS, GASPI, shared windows, shared notifications
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-249835 (URN)10.1109/PAW-ATM.2018.00006 (DOI)000462965600001 ()2-s2.0-85063078028 (Scopus ID)
Conference
2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM)
Note

QC 20190423

Available from: 2019-04-23 Created: 2019-04-23 Last updated: 2019-04-23Bibliographically approved
Rivas-Gomez, S., Pena, A. J., Moloney, D., Laure, E. & Markidis, S. (2018). Exploring the vision processing unit as co-processor for inference. In: Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018: . Paper presented at 32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018, Vancouver, Canada, 21 May 2018 through 25 May 2018 (pp. 589-598). Institute of Electrical and Electronics Engineers (IEEE), Article ID 8425465.
Open this publication in new window or tab >>Exploring the vision processing unit as co-processor for inference
Show others...
2018 (English)In: Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018, Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 589-598, article id 8425465Conference paper, Published paper (Refereed)
Abstract [en]

The success of the exascale supercomputer is largely debated to remain dependent on novel breakthroughs in technology that effectively reduce the power consumption and thermal dissipation requirements. In this work, we consider the integration of co-processors in high-performance computing (HPC) to enable low-power, seamless computation offloading of certain operations. In particular, we explore the so-called Vision Processing Unit (VPU), a highly-parallel vector processor with a power envelope of less than 1W. We evaluate this chip during inference using a pre-trained GoogLeNet convolutional network model and a large image dataset from the ImageNet ILSVRC challenge. Preliminary results indicate that a multi-VPU configuration provides similar performance compared to reference CPU and GPU implementations, while reducing the thermal-design power (TDP) up to 8x in comparison.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Keywords
High-Performance Computing, Machine Learning, Vision Processing Unit
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-234098 (URN)10.1109/IPDPSW.2018.00098 (DOI)2-s2.0-85052218072 (Scopus ID)9781538655559 (ISBN)
Conference
32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018, Vancouver, Canada, 21 May 2018 through 25 May 2018
Note

QC 20180905

Available from: 2018-09-05 Created: 2018-09-05 Last updated: 2018-09-05Bibliographically approved
Akhmetova, D., Cebamanos, L., Iakymchuk, R., Rotaru, T., Rahn, M., Markidis, S., . . . Simmendinger, C. (2018). Interoperability of GASPI and MPI in large scale scientific applications. In: 12th International Conference on Parallel Processing and Applied Mathematics, PPAM 2017: . Paper presented at 10 September 2017 through 13 September 2017 (pp. 277-287). Springer Verlag
Open this publication in new window or tab >>Interoperability of GASPI and MPI in large scale scientific applications
Show others...
2018 (English)In: 12th International Conference on Parallel Processing and Applied Mathematics, PPAM 2017, Springer Verlag , 2018, p. 277-287Conference paper, Published paper (Refereed)
Abstract [en]

One of the main hurdles of a broad distribution of PGAS approaches is the prevalence of MPI, which as a de-facto standard appears in the code basis of many applications. To take advantage of the PGAS APIs like GASPI without a major change in the code basis, interoperability between MPI and PGAS approaches needs to be ensured. In this article, we address this challenge by providing our study and preliminary performance results regarding interoperating GASPI and MPI on the performance crucial parts of the Ludwig and iPIC3D applications. In addition, we draw a strategy for better coupling of both APIs. 

Place, publisher, year, edition, pages
Springer Verlag, 2018
Keywords
GASPI, Halo exchange, Interoperability, iPIC3D, Ludwig, MPI, Artificial intelligence, Computer science, Computers, De facto standard, Preliminary performance results, Scientific applications
National Category
Mathematics
Identifiers
urn:nbn:se:kth:diva-227469 (URN)10.1007/978-3-319-78054-2_26 (DOI)000458563900026 ()2-s2.0-85044787063 (Scopus ID)9783319780535 (ISBN)
Conference
10 September 2017 through 13 September 2017
Note

QC 20180521

Available from: 2018-05-21 Created: 2018-05-21 Last updated: 2019-03-05Bibliographically approved
Rivas-Gomez, S., Gioiosa, R., Peng, I. B., Kestor, G., Narasimhamurthy, S., Laure, E. & Markidis, S. (2018). MPI windows on storage for HPC applications. Parallel Computing, 77, 38-56
Open this publication in new window or tab >>MPI windows on storage for HPC applications
Show others...
2018 (English)In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 77, p. 38-56Article in journal (Refereed) Published
Abstract [en]

Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI storage windows, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. In addition, we explore the integration of heterogeneous window allocations, where memory and storage share a unified virtual address space. When performing large, irregular memory operations, we verify that MPI windows on local storage incurs a 55% performance penalty on average. When using a Lustre parallel file system, "asymmetric" performance is observed with over 90% degradation in writing operations. Nonetheless, experimental results of a Distributed Hash Table, the HACC I/O kernel mini-application, and a novel MapReduce implementation based on the use of MPI one-sided communication, indicate that the overall penalty of MPI windows on storage can be negligible in most cases in real-world applications.

Place, publisher, year, edition, pages
Elsevier, 2018
Keywords
MPI windows on storage, Out-of-core computation, Parallel I/O
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-235114 (URN)10.1016/j.parco.2018.05.007 (DOI)000441688300003 ()2-s2.0-85048347715 (Scopus ID)
Note

QC 20180919

Available from: 2018-09-19 Created: 2018-09-19 Last updated: 2018-09-19Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0639-0639

Search in DiVA

Show all publications