Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 89) Show all publications
Peng, I. B., Vetter, J. S., Moore, S., Joydeep, R. & Markidis, S. (2019). Analyzing the Suitability of Contemporary 3D-Stacked PIM Architectures for HPC Scientific Applications. In: CF '19 - PROCEEDINGS OF THE 16TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS: . Paper presented at 16th ACM International Conference on Computing Frontiers, CF 2019; Alghero, Sardinia; Italy; 30 April 2019 through 2 May 2019 (pp. 256-262). ASSOC COMPUTING MACHINERY
Open this publication in new window or tab >>Analyzing the Suitability of Contemporary 3D-Stacked PIM Architectures for HPC Scientific Applications
Show others...
2019 (English)In: CF '19 - PROCEEDINGS OF THE 16TH ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS, ASSOC COMPUTING MACHINERY , 2019, p. 256-262Conference paper, Published paper (Refereed)
Abstract [en]

Scaling off-chip bandwidth is challenging due to fundamental limitations, such as a fixed pin count and plateauing signaling rates. Recently, vendors have turned to 2.5D and 3D stacking to closely integrate system components. Interestingly, these technologies can integrate a logic layer under multiple memory dies, enabling computing capability inside a memory stack. This trend in stacking is making PIM architectures commercially viable. In this work, we investigate the suitability of offloading kernels in scientific applications onto 3D stacked PIM architectures. We evaluate several hardware constraints resulted from the stacked structure. We perform extensive simulation experiments and indepth analysis to quantify the impact of application locality in TI,Bs, data caches, and memory stacks. Our results also identify design optimization areas in software and hardware for HPC scientific applications.

Place, publisher, year, edition, pages
ASSOC COMPUTING MACHINERY, 2019
Keywords
processing-in-memory, 3D stacked mernory, PIM, NNARD RH, 1974, IEEE JOURNAL OF SOLID-STATE CIRCUITS, VSC 9, P256
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-255514 (URN)10.1145/3310273.3322831 (DOI)000474686400036 ()2-s2.0-85066055698& (Scopus ID)
Conference
16th ACM International Conference on Computing Frontiers, CF 2019; Alghero, Sardinia; Italy; 30 April 2019 through 2 May 2019
Note

QC 20191016

Available from: 2019-10-16 Created: 2019-10-16 Last updated: 2019-10-16Bibliographically approved
Rivas Gomez, S., Markidis, S., Laure, E., Brabazon, K., Perks, O. & Narasimhamurthy, S. (2019). Decoupled Strategy for Imbalanced Workloads in MapReduce Frameworks. In: Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018: . Paper presented at 20th International Conference on High Performance Computing and Communications, 16th IEEE International Conference on Smart City and 4th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018, 28 June 2018 through 30 June 2018 (pp. 921-927). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Decoupled Strategy for Imbalanced Workloads in MapReduce Frameworks
Show others...
2019 (English)In: Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018, Institute of Electrical and Electronics Engineers (IEEE), 2019, p. 921-927Conference paper, Published paper (Refereed)
Abstract [en]

In this work, we consider the integration of MPI one-sided communication and non-blocking I/O in HPC-centric MapReduce frameworks. Using a decoupled strategy, we aim to overlap the Map and Reduce phases of the algorithm by allowing processes to communicate and synchronize using solely one-sided operations. Hence, we effectively increase the performance in situations where the workload per process becomes unexpectedly unbalanced. Using a Word-Count implementation and a large dataset from the Purdue MapReduce Benchmarks Suite (PUMA), we demonstrate that our approach can provide up to 23% performance improvement on average compared to a reference MapReduce implementation that uses state-of-the-art MPI collective communication and I/O.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2019
Keywords
High Performance Computing, MapReduce, MPI One Sided Communication
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-246358 (URN)10.1109/HPCC/SmartCity/DSS.2018.00153 (DOI)000468511200121 ()2-s2.0-85062487109 (Scopus ID)9781538666142 (ISBN)
Conference
20th International Conference on High Performance Computing and Communications, 16th IEEE International Conference on Smart City and 4th IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018, 28 June 2018 through 30 June 2018
Note

QC 20190319

Available from: 2019-03-19 Created: 2019-03-19 Last updated: 2019-11-01Bibliographically approved
Zhou, H., Toth, G., Jia, X., Chen, Y. & Markidis, S. (2019). Embedded Kinetic Simulation of Ganymede's Magnetosphere: Improvements and Inferences. Journal of Geophysical Research - Space Physics, 124(7), 5441-5460
Open this publication in new window or tab >>Embedded Kinetic Simulation of Ganymede's Magnetosphere: Improvements and Inferences
Show others...
2019 (English)In: Journal of Geophysical Research - Space Physics, ISSN 2169-9380, E-ISSN 2169-9402, Vol. 124, no 7, p. 5441-5460Article in journal (Refereed) Published
Abstract [en]

The largest moon in the solar system, Ganymede, is also the only moon known to possess a strong intrinsic magnetic field and a corresponding magnetosphere. Using the new version of Hall magnetohydrodynamic with embedded particle-in-cell model with a self-consistently coupled resistive body representing the electrical properties of the moon's interior, improved inner boundary conditions, and the flexibility of coupling different grid geometries, we achieve better match of magnetic field with measurements for all six Galileo flybys. The G2 flyby comparisons of plasma bulk flow velocities with the Galileo Plasma Subsystem data support the oxygen ion assumption inside Ganymede's magnetosphere. Crescent shape, nongyrotropic, and nonisotropic ion distributions are identified from the coupled model. Furthermore, we have derived the energy fluxes associated with the upstream magnetopause reconnection of similar to 10(-7) W/cm(2) based on our model results and found a maximum of 40% contribution to the total peak auroral emissions.

Place, publisher, year, edition, pages
AMER GEOPHYSICAL UNION, 2019
Keywords
Ganymede, simulation, magnetosphere, reconnection, COSTER RJ, 1979, JOURNAL OF GEOPHYSICAL RESEARCH-SPACE PHYSICS, V84, P5099 syliunas VM, 2000, GEOPHYSICAL RESEARCH LETTERS, V27, P1347 a Xianzhe, 2008, JOURNAL OF GEOPHYSICAL RESEARCH-SPACE PHYSICS, V113
National Category
Geophysics
Identifiers
urn:nbn:se:kth:diva-259461 (URN)10.1029/2019JA026643 (DOI)000482985600033 ()2-s2.0-85069678381 (Scopus ID)
Note

QC 20190920

Available from: 2019-09-20 Created: 2019-09-20 Last updated: 2019-09-20Bibliographically approved
Simmendinger, C., Iakymchuk, R., Cebamanos, L., Akhmetova, D., Bartsch, V., Rotaru, T., . . . Markidis, S. (2019). Interoperability strategies for GASPI and MPI in large-scale scientific applications. The international journal of high performance computing applications, 33(3), 554-568
Open this publication in new window or tab >>Interoperability strategies for GASPI and MPI in large-scale scientific applications
Show others...
2019 (English)In: The international journal of high performance computing applications, ISSN 1094-3420, E-ISSN 1741-2846, Vol. 33, no 3, p. 554-568Article in journal (Refereed) Published
Abstract [en]

One of the main hurdles of partitioned global address space (PGAS) approaches is the dominance of message passing interface (MPI), which as a de facto standard appears in the code basis of many applications. To take advantage of the PGAS APIs like global address space programming interface (GASPI) without a major change in the code basis, interoperability between MPI and PGAS approaches needs to be ensured. In this article, we consider an interoperable GASPI/MPI implementation for the communication/performance crucial parts of the Ludwig and iPIC3D applications. To address the discovered performance limitations, we develop a novel strategy for significantly improved performance and interoperability between both APIs by leveraging GASPI shared windows and shared notifications. First results with a corresponding implementation in the MiniGhost proxy application and the Allreduce collective operation demonstrate the viability of this approach.

Place, publisher, year, edition, pages
SAGE PUBLICATIONS LTD, 2019
Keywords
Interoperability, GASPI, MPI, iPIC3D, Ludwig, MiniGhost, halo exchange, Allreduce
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-254034 (URN)10.1177/1094342018808359 (DOI)000468919900011 ()2-s2.0-85059353725 (Scopus ID)
Note

QC 20190814

Available from: 2019-08-14 Created: 2019-08-14 Last updated: 2019-08-14Bibliographically approved
Wallden, M., Markidis, S., Okita, M. & Ino, F. (2019). Memory Efficient Load Balancing for Distributed Large-Scale Volume Rendering Using a Two-Layered Group Structure. IEICE transactions on information and systems, E102D(12), 2306-2316
Open this publication in new window or tab >>Memory Efficient Load Balancing for Distributed Large-Scale Volume Rendering Using a Two-Layered Group Structure
2019 (English)In: IEICE transactions on information and systems, ISSN 0916-8532, E-ISSN 1745-1361, Vol. E102D, no 12, p. 2306-2316Article in journal (Refereed) Published
Abstract [en]

We propose a novel compositing pipeline and a dynamic load balancing technique for volume rendering which utilizes a two-layered group structure to achieve effective and scalable load balancing. The technique enables each process to render data from non-contiguous regions of the volume with minimal impact on the total render time. We demonstrate the effectiveness of the proposed technique by performing a set of experiments on a modern GPU cluster. The experiments show that using the technique results in up to a 35.7% lower worst-case memory usage as compared to a dynamic k-d tree load balancing technique, whilst simultaneously achieving similar or higher render performance. The proposed technique was also able to lower the amount of transferred data during the load balancing stage by up to 72.2%. The technique has the potential to be used in many scenarios where other dynamic load balancing techniques have proved to be inadequate, such as during large-scale visualization.

Keywords
large-scale visualization, distributed computing, load balancing, GPU
National Category
Computer and Information Sciences Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-265514 (URN)10.1587/transinf.2019PAP0003 (DOI)000499697000004 ()
Note

QC 20191213

Available from: 2019-12-13 Created: 2019-12-13 Last updated: 2019-12-13Bibliographically approved
Sishtla, C. P., Olshevsky, V., Chien, W. D., Markidis, S. & Laure, E. (2019). Particle-in-Cell Simulations of Plasma Dynamics in Cometary Environment. In: Journal of Physics: Conference Series. Paper presented at 13th International Conference on Numerical Modeling of Space Plasma Flows, ASTRONUM 2018; Panama City Beach; United States; 25 June 2018 through 29 June 2018. Institute of Physics Publishing (IOPP), 1225(1), Article ID 012009.
Open this publication in new window or tab >>Particle-in-Cell Simulations of Plasma Dynamics in Cometary Environment
Show others...
2019 (English)In: Journal of Physics: Conference Series, Institute of Physics Publishing (IOPP), 2019, Vol. 1225, no 1, article id 012009Conference paper, Published paper (Refereed)
Abstract [en]

We perform and analyze global Particle-in-Cell (PIC) simulations of the interaction between solar wind and an outgassing comet with the goal of studying the plasma kinetic dynamics of a cometary environment. To achieve this, we design and implement a new numerical method in the iPIC3D code to model outgassing from the comet: new plasma particles are ejected from the comet "surface" at each computational cycle. Our simulations show that a bow shock is formed as a result of the interaction between solar wind and outgassed particles. The analysis of distribution functions for the PIC simulations shows that at the bow shock part of the incoming solar wind, ions are reflected while electrons are heated. This work attempts to reveal kinetic effects in the atmosphere of an outgassing comet using a fully kinetic Particle-in-Cell model.

Place, publisher, year, edition, pages
Institute of Physics Publishing (IOPP), 2019
Series
Journal of Physics: Conference Series, ISSN 17426588 ; 1225
National Category
Physical Sciences
Identifiers
urn:nbn:se:kth:diva-262635 (URN)10.1088/1742-6596/1225/1/012009 (DOI)000478669600009 ()2-s2.0-85068062214 (Scopus ID)
Conference
13th International Conference on Numerical Modeling of Space Plasma Flows, ASTRONUM 2018; Panama City Beach; United States; 25 June 2018 through 29 June 2018
Note

QC 20191018

Available from: 2019-10-18 Created: 2019-10-18 Last updated: 2019-11-07Bibliographically approved
Narasimhamurthy, S., Danilov, N., Wu, S., Umanesan, G., Markidis, S., Rivas-Gomez, S., . . . de Witt, S. (2019). SAGE: Percipient Storage for Exascale Data Centric Computing. Parallel Computing, 83, 22-33
Open this publication in new window or tab >>SAGE: Percipient Storage for Exascale Data Centric Computing
Show others...
2019 (English)In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 83, p. 22-33Article in journal (Refereed) Published
Abstract [en]

We aim to implement a Big Data/Extreme Computing (BDEC) capable system infrastructure as we head towards the era of Exascale computing - termed SAGE (Percipient StorAGe for Exascale Data Centric Computing). The SAGE system will be capable of storing and processing immense volumes of data at the Exascale regime, and provide the capability for Exascale class applications to use such a storage infrastructure. SAGE addresses the increasing overlaps between Big Data Analysis and HPC in an era of next-generation data centric computing that has developed due to the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors, whose data needs to be processed, analysed and integrated into simulations to derive scientific and innovative insights. Indeed, Exascale I/O, as a problem that has not been sufficiently dealt with for simulation codes, is appropriately addressed by the SAGE platform. The objective of this paper is to discuss the software architecture of the SAGE system and look at early results we have obtained employing some of its key methodologies, as the system continues to evolve.

Place, publisher, year, edition, pages
Elsevier, 2019
Keywords
SAGE architecture, Object storage, Mero, Clovis, PGAS I/O, MPI I/O, MPI streams
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-254119 (URN)10.1016/j.parco.2018.03.002 (DOI)000469898400003 ()2-s2.0-85044917976 (Scopus ID)
Note

QC 20190624

Available from: 2019-06-24 Created: 2019-06-24 Last updated: 2019-06-24Bibliographically approved
Thoman, P., Dichev, K., Heller, T., Iakymchuk, R., Aguilar, X., Hasanov, K., . . . Nikolopoulos, D. S. (2018). A taxonomy of task-based parallel programming technologies for high-performance computing. Journal of Supercomputing, 74(4), 1422-1434
Open this publication in new window or tab >>A taxonomy of task-based parallel programming technologies for high-performance computing
Show others...
2018 (English)In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 74, no 4, p. 1422-1434Article in journal (Refereed) Published
Abstract [en]

Task-based programming models for shared memory-such as Cilk Plus and OpenMP 3-are well established and documented. However, with the increase in parallel, many-core, and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.

Place, publisher, year, edition, pages
SPRINGER, 2018
Keywords
High-performance computing, Task-based parallelism, Taxonomy, API, Runtime system, Scheduler, Monitoring framework, Fault tolerance
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-226199 (URN)10.1007/s11227-018-2238-4 (DOI)000428284000002 ()2-s2.0-85041817729 (Scopus ID)
Note

QC 20180518

Available from: 2018-05-18 Created: 2018-05-18 Last updated: 2019-08-20Bibliographically approved
Thoman, P., Hasanov, K., Dichev, K., Iakymchuk, R., Aguilar, X., Gschwandtner, P., . . . Fahringer, T. (2018). A Taxonomy of Task-Based Technologies for High-Performance Computing. In: Wyrzykowski, R Dongarra, J Deelman, E Karczewski, K (Ed.), PARALLEL PROCESSING AND APPLIED MATHEMATICS (PPAM 2017), PT II: . Paper presented at 12th International Conference on Parallel Processing and Applied Mathematics (PPAM), SEP 10-13, 2017, Lublin, POLAND (pp. 264-274). SPRINGER INTERNATIONAL PUBLISHING AG
Open this publication in new window or tab >>A Taxonomy of Task-Based Technologies for High-Performance Computing
Show others...
2018 (English)In: PARALLEL PROCESSING AND APPLIED MATHEMATICS (PPAM 2017), PT II / [ed] Wyrzykowski, R Dongarra, J Deelman, E Karczewski, K, SPRINGER INTERNATIONAL PUBLISHING AG , 2018, p. 264-274Conference paper, Published paper (Refereed)
Abstract [en]

Task-based programming models for shared memory - such as Cilk Plus and OpenMP 3 - are well established and documented. However, with the increase in heterogeneous, many-core and parallel systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing, no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.

Place, publisher, year, edition, pages
SPRINGER INTERNATIONAL PUBLISHING AG, 2018
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 10778
Keywords
Task-based parallelism, Taxonomy, API, Runtime system, Scheduler, Monitoring framework, Fault tolerance
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-245025 (URN)10.1007/978-3-319-78054-2_25 (DOI)000458563900025 ()2-s2.0-85044764286 (Scopus ID)978-3-319-78054-2 (ISBN)
Conference
12th International Conference on Parallel Processing and Applied Mathematics (PPAM), SEP 10-13, 2017, Lublin, POLAND
Note

QC 20190305

Available from: 2019-03-05 Created: 2019-03-05 Last updated: 2019-03-05Bibliographically approved
Chien, S. W. D., Markidis, S., Sishtla, C. P., Santos, L., Herman, P., Nrasimhamurthy, S. & Laure, E. (2018). Characterizing Deep-Learning I/O Workloads in TensorFlow. In: Proceedings of PDSW-DISCS 2018: 3rd Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis: . Paper presented at 3rd IEEE/ACM Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS 2018; Dallas; United States; 12 November 2018 (pp. 54-63). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Characterizing Deep-Learning I/O Workloads in TensorFlow
Show others...
2018 (English)In: Proceedings of PDSW-DISCS 2018: 3rd Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis, Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 54-63Conference paper, Published paper (Refereed)
Abstract [en]

The performance of Deep-Learning (DL) computing frameworks rely on the rformance of data ingestion and checkpointing. In fact, during the aining, a considerable high number of relatively small files are first aded and pre-processed on CPUs and then moved to accelerator for mputation. In addition, checkpointing and restart operations are rried out to allow DL computing frameworks to restart quickly from a eckpoint. Because of this, I/O affects the performance of DL plications. this work, we characterize the I/O performance and scaling of nsorFlow, an open-source programming framework developed by Google and ecifically designed for solving DL problems. To measure TensorFlow I/O rformance, we first design a micro-benchmark to measure TensorFlow ads, and then use a TensorFlow mini-application based on AlexNet to asure the performance cost of I/O and checkpointing in TensorFlow. To prove the checkpointing performance, we design and implement a burst ffer. find that increasing the number of threads increases TensorFlow ndwidth by a maximum of 2.3 x and 7.8 x on our benchmark environments. e use of the tensorFlow prefetcher results in a complete overlap of mputation on accelerator and input pipeline on CPU eliminating the fective cost of I/O on the overall performance. The use of a burst ffer to checkpoint to a fast small capacity storage and copy ynchronously the checkpoints to a slower large capacity storage sulted in a performance improvement of 2.6x with respect to eckpointing directly to slower storage on our benchmark environment.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Keywords
Parallel I/O, Input Pipeline, Deep Learning, TensorFlow
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-248377 (URN)10.1109/PDSW-DISCS.2018.00011 (DOI)000462205000006 ()2-s2.0-85063062239 (Scopus ID)
Conference
3rd IEEE/ACM Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, PDSW-DISCS 2018; Dallas; United States; 12 November 2018
Note

QC 20190405

Available from: 2019-04-05 Created: 2019-04-05 Last updated: 2019-04-05Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0639-0639

Search in DiVA

Show all publications