kth.sePublications
Change search
Refine search result
1 - 41 of 41
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Allen, Tyler
    et al.
    University of North Carolina, Charlotte, United States.
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Brightwell, Ron
    Sandia National Laboratories, United States.
    Gokhale, Maya
    Lawrence Livermore National Laboratory, United States.
    Workshop on Memory Technologies, Systems, and Applications (MTSA'23)2023In: ACM International Conference Proceeding Series, Association for Computing Machinery , 2023, p. 961-Conference paper (Other academic)
  • 2.
    Araújo De Medeiros, Daniel
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    LibCOS: Enabling Converged HPC and Cloud Data Stores with MPI2023In: Proceedings of International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2023, Association for Computing Machinery (ACM) , 2023, p. 106-116Conference paper (Refereed)
    Abstract [en]

    Recently, federated HPC and cloud resources are becoming increasingly strategic for providing diversified and geographically available computing resources. However, accessing data stores across HPC and cloud storage systems is challenging. Many cloud providers use object storage systems to support their clients in storing and retrieving data over the internet. One popular method is REST APIs atop the HTTP protocol, with Amazon's S3 APIs being supported by most vendors. In contrast, HPC systems are contained within their networks and tend to use parallel file systems with POSIX-like interfaces. This work addresses the challenge of diverse data stores on HPC and cloud systems by providing native object storage support through the unified MPI I/O interface in HPC applications. In particular, we provide a prototype library called LibCOS that transparently enables MPI applications running on HPC systems to access object storage on remote cloud systems. We evaluated LibCOS on a Ceph object storage system and a traditional HPC system. In addition, we conducted performance characterization of core S3 operations that enable individual and collective MPI I/O. Our evaluation in HACC, IOR, and BigSort shows that enabling diverse data stores on HPC and Cloud storage is feasible and can be transparently achieved through the widely adopted MPI I/O. Also, we show that a native object storage system like Ceph could improve the scalability of I/O operations in parallel applications.

  • 3.
    Araújo De Medeiros, Daniel
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Wahlgren, Jacob
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Schieffer, Gabin
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Kub: Enabling Elastic HPC Workloads on Containerized Environments2023In: Proceedings of the 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Institute of Electrical and Electronics Engineers (IEEE), 2023Conference paper (Refereed)
    Abstract [en]

    The conventional model of resource allocation in HPC systems is static. Thus, a job cannot leverage newly available resources in the system or release underutilized resources during the execution. In this paper, we present Kub, a methodology that enables elastic execution of HPC workloads on Kubernetes so that the resources allocated to a job can be dynamically scaled during the execution. One main optimization of our method is to maximize the reuse of the originally allocated resources so that the disruption to the running job can be minimized. The scaling procedure is coordinated among nodes through remote procedure calls on Kubernetes for deploying workloads in the cloud. We evaluate our approach using one synthetic benchmark and two production-level MPI-based HPC applications - GRO-MACS and CM1. Our results demonstrate that the benefits of adapting the allocated resources depend on the workload characteristics. In the tested cases, a properly chosen scaling point for increasing resources during execution achieved up to 2x speedup. Also, the overhead of checkpointing and data reshuffling significantly influences the selection of optimal scaling points and requires application-specific knowledge.

  • 4.
    Brightwell, Ron
    et al.
    Sandia National Laboratories, United States.
    Peng, Ivy BoKTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).Gokhale, Maya B.Lawrence Livermore National Laboratory, United States.Yan, YonghongUniversity of North Carolina, Charlotte, NC, United States.
    Message from the MCHPC22 Workshop Chairs2022Conference proceedings (editor) (Other academic)
  • 5. Chen, Yuxi
    et al.
    Toth, Gabor
    Cassak, Paul
    Jia, Xianzhe
    Gombosi, Tamas I.
    Slavin, James A.
    Markidis, Stefano
    KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Peng, Ivy Bo
    KTH.
    Jordanova, Vania K.
    Henderson, Michael G.
    Global Three-Dimensional Simulation of Earth's Dayside Reconnection Using a Two-Way Coupled Magnetohydrodynamics With Embedded Particle-in-Cell Model: Initial Results2017In: Journal of Geophysical Research - Space Physics, ISSN 2169-9380, E-ISSN 2169-9402, Vol. 122, no 10, p. 10318-10335Article in journal (Refereed)
    Abstract [en]

    We perform a three-dimensional (3-D) global simulation of Earth's magnetosphere with kinetic reconnection physics to study the flux transfer events (FTEs) and dayside magnetic reconnection with the recently developed magnetohydrodynamics with embedded particle-in-cell model. During the 1 h long simulation, the FTEs are generated quasi-periodically near the subsolar point and move toward the poles. We find that the magnetic field signature of FTEs at their early formation stage is similar to a "crater FTE," which is characterized by a magnetic field strength dip at the FTE center. After the FTE core field grows to a significant value, it becomes an FTE with typical flux rope structure. When an FTE moves across the cusp, reconnection between the FTE field lines and the cusp field lines can dissipate the FTE. The kinetic features are also captured by our model. A crescent electron phase space distribution is found near the reconnection site. A similar distribution is found for ions at the location where the Larmor electric field appears. The lower hybrid drift instability (LHDI) along the current sheet direction also arises at the interface of magnetosheath and magnetosphere plasma. The LHDI electric field is about 8 mV/m, and its dominant wavelength relative to the electron gyroradius agrees reasonably with Magnetospheric Multiscale (MMS) observations.

  • 6.
    Chien, Steven Wei Der
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Sishtla, Chaitanya Prasad
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Jun, Zhang
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Laure, Erwin
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    An Evaluation of the TensorFlow Programming Model for Solving Traditional HPC Problems2018In: Proceedings of the 5th International Conference on Exascale Applications and Software, The University of Edinburgh , 2018, p. 34-Conference paper (Refereed)
    Abstract [en]

    Computational intensive applications such as pattern recognition, and natural language processing, are increasingly popular on HPC systems. Many of these applications use deep-learning, a branch of machine learning, to determine the weights of artificial neural network nodes by minimizing a loss function. Such applications depend heavily on dense matrix multiplications, also called tensorial operations. The use of Graphics Processing Unit (GPU) has considerably speeded up deep-learning computations, leading to a Renaissance of the artificial neural network. Recently, the NVIDIA Volta GPU and the Google Tensor Processing Unit (TPU) have been specially designed to support deep-learning workloads. New programming models have also emerged for convenient expression of tensorial operations and deep-learning computational paradigms. An example of such new programming frameworks is TensorFlow, an open-source deep-learning library released by Google in 2015. TensorFlow expresses algorithms as a computational graph where nodes represent operations and edges between nodes represent data flow. Multi-dimensional data such as vectors and matrices which flows between operations are called Tensors. For this reason, computation problems need to be expressed as a computational graph. In particular, TensorFlow supports distributed computation with flexible assignment of operation and data to devices such as GPU and CPU on different computing nodes. Computation on devices are based on optimized kernels such as MKL, Eigen and cuBLAS. Inter-node communication can be through TCP and RDMA. This work attempts to evaluate the usability and expressiveness of the TensorFlow programming model for traditional HPC problems. As an illustration, we prototyped a distributed block matrix multiplication for large dense matrices which cannot be co-located on a single device and a Conjugate Gradient (CG) solver. We evaluate the difficulty of expressing traditional HPC algorithms using computational graphs and study the scalability of distributed TensorFlow on accelerated systems. Our preliminary result with distributed matrix multiplication shows that distributed computation on TensorFlow is extremely scalable. This study provides an initial investigation of new emerging programming models for HPC.

    Download full text (pdf)
    fulltext
  • 7.
    Cllasun, Hüsrev
    et al.
    University of Minnesota, USA.
    MacAraeg, Chris
    Lawrence Livermore National Lab, USA.
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Sarkar, Abhik
    Lawrence Livermore National Lab, USA.
    Gokhale, Maya
    KTH.
    FPGA-accelerated simulation of variable latency memory systems2022In: MEMSYS 2022 - Proceedings of the International Symposium on Memory Systems, Association for Computing Machinery (ACM) , 2022, article id 8Conference paper (Refereed)
    Abstract [en]

    With the growing complexity of memory types, organizations, and placement, efficient use of memory systems remains a key objective to processing data-rich workloads. Heterogeneous memories including HBM, conventional DRAM, and persistent memory, both locally and network-attached, exhibit a wide range of latencies and bandwidths. The delivered performance to an application may vary widely depending on workload and interference from competing clients. Evaluating the impact on applications to these emerging memory systems challenges traditional simulation techniques. In this work, we describe VLD-sim, an FPGA-accelerated simulator designed to evaluate application performance in the presence of varying non-deterministic latency. VLD-sim implements a statistical approach in which memory system access latency is non-deterministic, as would occur when request traffic is generated from a large collection of possibly unrelated threads and compute nodes. VLD-sim runs on a Multi-Processor System on Chip with hard CPU plus configurable logic to enable fast evaluation of workloads or of individual applications. We evaluate VLD-sim with CPU-only and near memory accelerator-enabled applications and compare against an idealized fixed latency baseline. Our findings reveal and quantify performance impact on applications due to non-deterministic latency. With high flexibility and and fast execution time, VLD-sim enables system level evaluation of a large memory architecture design space.

  • 8.
    Faj, Jennifer
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Williams, Jeremy J.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Ganse, Urs
    University of Helsinki, Helsinki, Finland.
    Battarbee, Markus
    University of Helsinki, Helsinki, Finland.
    Pfau-Kempf, Yann
    University of Helsinki, Helsinki, Finland.
    Kotipalo, Leo
    University of Helsinki, Helsinki, Finland.
    Palmroth, Minna
    University of Helsinki, Helsinki, Finland.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    MPI Performance Analysis in Vlasiator: Unraveling Communication Bottlenecks2023In: SC23 Proccedings: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Denver, Colorado, USA, 2023Conference paper (Refereed)
    Abstract [en]

    Vlasiator is a popular and powerful massively parallel code for accurate magnetospheric and solar wind plasma simulations. This work provides an in-depth analysis of Vlasiator, focusing on MPI performance using the Integrated Performance Monitoring (IPM) tool. We show that MPI non-blocking point-to-point communication accounts for most of the communication time. The communication topology shows a large number of MPI messages exchanging data in a six-dimensional grid. We also show that relatively large messages are used in MPI communication, reaching up to 256MB. As a communication-bound application, we found that using OpenMP in Vlasiator is critical for eliminating intra-node communication. Our results provide important insights for optimizing Vlasiator for the upcoming Exascale machines.

  • 9.
    Iakymchuk, Roman
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST). KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
    Jordan, Herbert
    University of Innsbruck, Institute of Computer Science.
    Peng, Ivy Bo
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST). KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST). KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST). KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
    A Particle-in-Cell Method for Automatic Load-Balancing with the AllScale Environment2016Conference paper (Other academic)
    Abstract [en]

    We present an initial design and implementation of a Particle-in-Cell (PIC) method based on the work carried out in the European Exascale AllScale project. AllScale provides a unified programming system for the effective development of highly scalable, resilient and performance-portable parallel applications for Exascale systems. The AllScale approach is based on task-based nested recursive parallelism and it provides mechanisms for automatic load-balancing in the PIC simulations. We provide the preliminary results of the AllScale-based PIC implementation and draw directions for its future development. 

    Download full text (pdf)
    fulltext
  • 10.
    Ivanov, Ilya
    et al.
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Gong, Jing
    KTH, Centres, SeRC - Swedish e-Science Research Centre. KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
    Akhmetova, Dana
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Peng, Ivy Bo
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz). KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz). KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Machado, Rui
    Rahn, Mirko
    Bartsch, Valeria
    Hart, Alistair
    Fischer, Paul
    Evaluation of Parallel Communication Models in Nekbone, a Nek5000 mini-application2015In: 2015 IEEE International Conference on Cluster Computing, IEEE , 2015, p. 760-767Conference paper (Refereed)
    Abstract [en]

    Nekbone is a proxy application of Nek5000, a scalable Computational Fluid Dynamics (CFD) code used for modelling incompressible flows. The Nekbone mini-application is used by several international co-design centers to explore new concepts in computer science and to evaluate their performance. We present the design and implementation of a new communication kernel in the Nekbone mini-application with the goal of studying the performance of different parallel communication models. First, a new MPI blocking communication kernel has been developed to solve Nekbone problems in a three-dimensional Cartesian mesh and process topology. The new MPI implementation delivers a 13% performance improvement compared to the original implementation. The new MPI communication kernel consists of approximately 500 lines of code against the original 7,000 lines of code, allowing experimentation with new approaches in Nekbone parallel communication. Second, the MPI blocking communication in the new kernel was changed to the MPI non-blocking communication. Third, we developed a new Partitioned Global Address Space (PGAS) communication kernel, based on the GPI-2 library. This approach reduces the synchronization among neighbor processes and is on average 3% faster than the new MPI-based, non-blocking, approach. In our tests on 8,192 processes, the GPI-2 communication kernel is 3% faster than the new MPI non-blocking communication kernel. In addition, we have used the OpenMP in all the versions of the new communication kernel. Finally, we highlight the future steps for using the new communication kernel in the parent application Nek5000.

  • 11. Johlander, A.
    et al.
    Schwartz, S. J.
    Vaivads, Andris
    Khotyaintsev, Yu. V.
    Gingell, I.
    Peng, Bo
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Lindqvist, Per-Arne
    KTH, School of Electrical Engineering (EES), Space and Plasma Physics.
    Ergun, R. E.
    Marklund, G. T.
    Plaschke, F.
    Magnes, W.
    Strangeway, R. J.
    Russell, C. T.
    Wei, H.
    Torbert, R. B.
    Paterson, W. R.
    Gershman, D. J.
    Dorelli, J. C.
    Avanov, L. A.
    Lavraud, B.
    Saito, Y.
    Giles, B. L.
    Pollock, C. J.
    Burch, J. L.
    Rippled Quasiperpendicular Shock Observed by the Magnetospheric Multiscale Spacecraft2016In: Physical Review Letters, ISSN 0031-9007, E-ISSN 1079-7114, Vol. 117, no 16, article id 165101Article in journal (Refereed)
    Abstract [en]

    Collisionless shock nonstationarity arising from microscale physics influences shock structure and particle acceleration mechanisms. Nonstationarity has been difficult to quantify due to the small spatial and temporal scales. We use the closely spaced (subgyroscale), high-time-resolution measurements from one rapid crossing of Earth's quasiperpendicular bow shock by the Magnetospheric Multiscale (MMS) spacecraft to compare competing nonstationarity processes. Using MMS's high-cadence kinetic plasma measurements, we show that the shock exhibits nonstationarity in the form of ripples.

  • 12.
    Liu, Xueyang
    et al.
    Georgia Institute of Technology, Atlanta, GA, USA.
    Gonzalez-Guerrero, Patricia
    Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Minnich, Ronald
    Samsung Semiconductor Inc., San Jose, CA, USA.
    Gokhale, Maya
    Lawrence Livermore National Laboratory, Livermore, CA, USA.
    Accelerator integration in a tile-based SoC: lessons learned with a hardware floating point compression engine2023In: Proceedings of 2023 SC Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023, Association for Computing Machinery (ACM) , 2023, p. 1662-1669Conference paper (Refereed)
    Abstract [en]

    Heterogeneous Intellectual Property (IP) hardware acceleration engines have emerged as a viable path forward to improving performance in the waning of Moore's Law and Dennard scaling. In this study, we design, prototype, and evaluate the HPC-specialized ZHW floating point compression accelerator as a resource on a System on Chip (SoC). Our full hardware/software implementation and evaluation reveal inefficiencies at the system level that significantly throttle the potential speedup of the ZHW accelerator. By optimizing data movement between CPU, memory, and accelerator, 6.9X is possible compared to a RISC-V64 core, and 2.9X over a Mac M1 ARM core.

  • 13.
    Ma, Yingjuan
    et al.
    Univ Calif Los Angeles, Dept Earth Planetary & Space Sci, Los Angeles, CA 90095 USA..
    Russell, Christopher T.
    Univ Calif Los Angeles, Dept Earth Planetary & Space Sci, Los Angeles, CA 90095 USA..
    Toth, Gabor
    Univ Michigan, Dept Climate & Space Sci & Engn, Ann Arbor, MI 48109 USA..
    Chen, Yuxi
    Univ Michigan, Dept Climate & Space Sci & Engn, Ann Arbor, MI 48109 USA..
    Nagy, Andrew F.
    Univ Michigan, Dept Climate & Space Sci & Engn, Ann Arbor, MI 48109 USA..
    Harada, Yuki
    Univ Iowa, Dept Phys & Astron, Iowa City, IA 52242 USA..
    McFadden, James
    Univ Calif Berkeley, Space Sci Lab, Berkeley, CA 94720 USA..
    Halekas, Jasper S.
    Univ Iowa, Dept Phys & Astron, Iowa City, IA 52242 USA..
    Lillis, Rob
    Univ Calif Berkeley, Space Sci Lab, Berkeley, CA 94720 USA..
    Connerney, John E. P.
    NASA, Goddard Space Flight Ctr, Greenbelt, MD USA..
    Espley, Jared
    NASA, Goddard Space Flight Ctr, Greenbelt, MD USA..
    DiBraccio, Gina A.
    NASA, Goddard Space Flight Ctr, Greenbelt, MD USA..
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Fang, Xiaohua
    Univ Colorado, Lab Atmospher & Space Phys, Boulder, CO 80309 USA..
    Jakosky, Bruce M.
    Univ Colorado, Lab Atmospher & Space Phys, Boulder, CO 80309 USA..
    Reconnection in the Martian Magnetotail: Hall-MHD With Embedded Particle-in-Cell Simulations2018In: Journal of Geophysical Research - Space Physics, ISSN 2169-9380, E-ISSN 2169-9402, Vol. 123, no 5, p. 3742-3763Article in journal (Refereed)
    Abstract [en]

    Mars Atmosphere and Volatile EvolutioN (MAVEN) mission observations show clear evidence of the occurrence of the magnetic reconnection process in the Martian plasma tail. In this study, we use sophisticated numerical models to help us understand the effects of magnetic reconnection in the plasma tail. The numerical models used in this study are (a) a multispecies global Hall-magnetohydrodynamic (HMHD) model and (b) a global HMHD model two-way coupled to an embedded fully kinetic particle-in-cell code. Comparison with MAVEN observations clearly shows that the general interaction pattern is well reproduced by the global HMHD model. The coupled model takes advantage of both the efficiency of the MHD model and the ability to incorporate kinetic processes of the particle-in-cell model, making it feasible to conduct kinetic simulations for Mars under realistic solar wind conditions for the first time. Results from the coupled model show that the Martian magnetotail is highly dynamic due to magnetic reconnection, and the resulting Mars-ward plasma flow velocities are significantly higher for the lighter ion fluid, which are quantitatively consistent with MAVEN observations. The HMHD with Embedded Particle-in-Cell model predicts that the ion loss rates are more variable but with similar mean values as compared with HMHD model results.

  • 14.
    Markidis, Stefano
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Peng, Ivy Bo
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Iakymchuk, Roman
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Kestor, G.
    Gioiosa, R.
    A performance characterization of streaming computing on supercomputers2016In: Procedia Computer Science, Elsevier, 2016, p. 98-107Conference paper (Refereed)
    Abstract [en]

    Streaming computing models allow for on-the-y processing of large data sets. With the increased demand for processing large amount of data in a reasonable period of time, streaming models are more and more used on supercomputers to solve data-intensive problems. Because supercomputers have been mainly used for compute-intensive workload, supercomputer performance metrics focus on the number of oating point operations in time and cannot fully characterize a streaming application performance on supercomputers. We introduce the injection and processing rates as the main metrics to characterize the performance of streaming computing on supercomputers. We analyze the dynamics of these quantities in a modi ed STREAM benchmark developed atop of an MPI streaming library in a series of di erent congurations. We show that after a brief transient the injection and processing rates converge to sustained rates. We also demonstrate that streaming computing performance strongly depends on the number of connections between data producers and consumers and on the processing task granularity.

  • 15.
    Markidis, Stefano
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Peng, Ivy Bo
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Larsson Träff, Jesper
    Rougier, Antoine
    Bartsch, Valeria
    Machado, Rui
    Rahn, Mirko
    Hart, Alistair
    Holmes, Daniel
    Bull, Mark
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), Centres, Centre for High Performance Computing, PDC.
    The EPiGRAM Project: Preparing Parallel Programming Models for Exascale2016In: HIGH PERFORMANCE COMPUTING, ISC HIGH PERFORMANCE 2016 INTERNATIONAL WORKSHOPS, Springer, 2016, p. 56-68Conference paper (Refereed)
    Abstract [en]

    EPiGRAM is a European Commission funded project to improve existing parallel programming models to run efficiently large scale applications on exascale supercomputers. The EPiGRAM project focuses on the two current dominant petascale programming models, message-passing and PGAS, and on the improvement of two of their associated programming systems, MPI and GASPI. In EPiGRAM, we work on two major aspects of programming systems. First, we improve the performance of communication operations by decreasing the memory consumption, improving collective operations and introducing emerging computing models. Second, we enhance the interoperability of message-passing and PGAS by integrating them in one PGAS-based MPI implementation, called EMPI4Re, implementing MPI endpoints and improving GASPI interoperability with MPI. The new EPiGRAM concepts are tested in two large-scale applications, iPIC3D, a Particle-in-Cell code for space physics simulations, and Nek5000, a Computational Fluid Dynamics code.

  • 16.
    Markidis, Stefano
    et al.
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Vencels, Juris
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Peng, Ivy Bo
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Akhmetova, Dana
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Henri, Pierre
    Idle waves in high-performance computing2015In: Physical Review E. Statistical, Nonlinear, and Soft Matter Physics, ISSN 1539-3755, E-ISSN 1550-2376, Vol. 91, no 1, p. 013306-Article in journal (Refereed)
    Abstract [en]

    The vast majority of parallel scientific applications distributes computation among processes that are in a busy state when computing and in an idle state when waiting for information from other processes. We identify the propagation of idle waves through processes in scientific applications with a local information exchange between the two processes. Idle waves are nondispersive and have a phase velocity inversely proportional to the average busy time. The physical mechanism enabling the propagation of idle waves is the local synchronization between two processes due to remote data dependency. This study provides a description of the large number of processes in parallel scientific applications as a continuous medium. This work also is a step towards an understanding of how localized idle periods can affect remote processes, leading to the degradation of global performance in parallel scientific applications.

  • 17. Narasimhamurthy, S.
    et al.
    Danilov, N.
    Wu, S.
    Umanesan, G.
    Chien, Steven Wei Der
    KTH.
    Rivas-Gomez, Sergio
    KTH.
    Peng, Ivy Bo
    KTH.
    Laure, Erwin
    KTH.
    De Witt, S.
    Pleiter, D.
    Markidis, Stefano
    KTH.
    The SAGE project: A storage centric approach for exascale computing2018In: 2018 ACM International Conference on Computing Frontiers, CF 2018 - Proceedings, Association for Computing Machinery (ACM), 2018, p. 287-292Conference paper (Refereed)
    Abstract [en]

    SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a storage centric approach as it is capable of storing and processing large data volumes at the Exascale regime. SAGE addresses the convergence of Big Data Analysis and HPC in an era of next-generation data centric computing. This convergence is driven by the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors where data needs to be processed, analyzed and integrated into simulations to derive scientific and innovative insights. A first prototype of the SAGE system has been been implemented and installed at the Jülich Supercomputing Center. The SAGE storage system consists of multiple types of storage device technologies in a multi-tier I/O hierarchy, including flash, disk, and non-volatile memory technologies. The main SAGE software component is the Seagate Mero Object Storage that is accessible via the Clovis API and higher level interfaces. The SAGE project also includes scientific applications for the validation of the SAGE concepts. The objective of this paper is to present the SAGE project concepts, the prototype of the SAGE platform and discuss the software architecture of the SAGE system.

  • 18.
    Narasimhamurthy, Sai
    et al.
    Seagate Syst UK, London, England..
    Danilov, Nikita
    Seagate Syst UK, London, England..
    Wu, Sining
    Seagate Syst UK, London, England..
    Umanesan, Ganesan
    Seagate Syst UK, London, England..
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Rivas-Gomez, Sergio
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Laure, Erwin
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Pleiter, Dirk
    Julich Supercomp Ctr, Julich, Germany..
    de Witt, Shaun
    Culham Ctr Fus Energy, Abingdon, Oxon, England..
    SAGE: Percipient Storage for Exascale Data Centric Computing2019In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 83, p. 22-33Article in journal (Refereed)
    Abstract [en]

    We aim to implement a Big Data/Extreme Computing (BDEC) capable system infrastructure as we head towards the era of Exascale computing - termed SAGE (Percipient StorAGe for Exascale Data Centric Computing). The SAGE system will be capable of storing and processing immense volumes of data at the Exascale regime, and provide the capability for Exascale class applications to use such a storage infrastructure. SAGE addresses the increasing overlaps between Big Data Analysis and HPC in an era of next-generation data centric computing that has developed due to the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors, whose data needs to be processed, analysed and integrated into simulations to derive scientific and innovative insights. Indeed, Exascale I/O, as a problem that has not been sufficiently dealt with for simulation codes, is appropriately addressed by the SAGE platform. The objective of this paper is to discuss the software architecture of the SAGE system and look at early results we have obtained employing some of its key methodologies, as the system continues to evolve.

  • 19. Olshevsky, Vyacheslav
    et al.
    Deca, Jan
    Divin, Andrey
    Peng, Ivy Bo
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Innocenti, Maria Elena
    Cazzola, Emanuele
    Lapenta, Giovanni
    Magnetic Null Points In Kinetic Simulations of Space Plasmas2016In: Astrophysical Journal, ISSN 0004-637X, E-ISSN 1538-4357, Vol. 819, no 1, article id 52Article in journal (Refereed)
    Abstract [en]

    We present a systematic attempt to study magnetic null points and the associated magnetic energy conversion in kinetic particle-in-cell simulations of various plasma configurations. We address three-dimensional simulations performed with the semi-implicit kinetic electromagnetic code iPic3D in different setups: variations of a Harris current sheet, dipolar and quadrupolar magnetospheres interacting with the solar wind,. and a relaxing turbulent configuration with multiple null points. Spiral nulls are more likely created in space plasmas: in all our simulations except lunar magnetic anomaly (LMA) and quadrupolar mini-magnetosphere the number of spiral nulls prevails over the number of radial nulls by a factor of 3-9. We show that often magnetic nulls do not indicate the regions of intensive energy dissipation. Energy dissipation events caused by topological bifurcations at radial nulls are rather rare and short-lived. The so-called X-lines formed by the radial nulls in the Harris current sheet and LMA simulations are rather stable and do not exhibit any energy dissipation. Energy dissipation is more powerful in the vicinity of spiral nulls enclosed by magnetic flux ropes with strong currents at their axes (their cross. sections resemble 2D magnetic islands). These null lines reminiscent of Z-pinches efficiently dissipate magnetic energy due to secondary instabilities such as the two-stream or kinking instability, accompanied by changes in magnetic topology. Current enhancements accompanied by spiral nulls may signal magnetic energy conversion sites in the observational data.

  • 20.
    Peng, Bo
    et al.
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Vaivads, A.
    Vencels, Juris
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Amaya, J.
    Divin, A.
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Lapenta, G.
    The formation of a magnetosphere with implicit Particle-in-Cell simulations2015In: Procedia Computer Science, Elsevier, 2015, no 1, p. 1178-1187Conference paper (Refereed)
    Abstract [en]

    We demonstrate the improvements to an implicit Particle-in-Cell code, iPic3D, on the example of dipolar magnetic field immersed in the flow of the plasma and show the formation of a magnetosphere. We address the problem of modelling multi-scale phenomena during the formation of a magnetosphere by implementing an adaptive sub-cycling technique to resolve the motion of particles located close to the magnetic dipole centre, where the magnetic field intensity is maximum. In addition, we implemented new open boundary conditions to model the inflow and outflow of plasma. We present the results of a global three-dimensional Particle-in-Cell simulation and discuss the performance improvements from the adaptive sub-cycling technique.

  • 21.
    Peng, I. B.
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    The cost of synchronizing imbalanced processes in message passing systems2015In: Proceedings - IEEE International Conference on Cluster Computing, ICCC, Institute of Electrical and Electronics Engineers (IEEE), 2015, p. 408-417Conference paper (Refereed)
    Abstract [en]

    Synchronization in message passing systems is achieved by communication among processes. System and architectural noise and different workloads cause processes to be imbalanced and to reach synchronization points at different time. Thus, both communication and imbalance impact the synchronization performance. In this paper, we study the algorithmic properties that allow the communication in synchronization to absorb the initial imbalance among processes. We quantify the imbalance absorption properties of different barrier algorithms using a LogP Monte Carlo simulator. We found that linear and f-way tournament barriers can absorb up to 95% of random exponential imbalance with the standard deviation equal to the communication time for one message. Dissemination, butterfly and pairwise exchange barriers, on the other hand, do not absorb imbalance but can effectively bound the post-barrier imbalance. We identify that synchronization transits from communication-dominated to imbalance-dominated when the standard deviation of imbalance distribution is more than twice the communication time for one message. In our study, f-way tournament barriers provided the best imbalance absorption rate and convenient communication time.

  • 22.
    Peng, I. Bo
    et al.
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Johlander, A.
    Vaivads, A.
    Khotyaintsev, Y.
    Henri, P.
    Lapenta, G.
    Kinetic structures of quasi-perpendicular shocks in global particle-in-cell simulations2015In: Physics of Plasmas, ISSN 1070-664X, E-ISSN 1089-7674, Vol. 22, no 9, article id 092109Article in journal (Refereed)
    Abstract [en]

    We carried out global Particle-in-Cell simulations of the interaction between the solar wind and a magnetosphere to study the kinetic collisionless physics in super-critical quasi-perpendicular shocks. After an initial simulation transient, a collisionless bow shock forms as a result of the interaction of the solar wind and a planet magnetic dipole. The shock ramp has a thickness of approximately one ion skin depth and is followed by a trailing wave train in the shock downstream. At the downstream edge of the bow shock, whistler waves propagate along the magnetic field lines and the presence of electron cyclotron waves has been identified. A small part of the solar wind ion population is specularly reflected by the shock while a larger part is deflected and heated by the shock. Solar wind ions and electrons are heated in the perpendicular directions. Ions are accelerated in the perpendicular direction in the trailing wave train region. This work is an initial effort to study the electron and ion kinetic effects developed near the bow shock in a realistic magnetic field configuration.

  • 23.
    Peng, I. Bo
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Kestor, G.
    Gioiosa, R.
    Exploring Application Performance on Emerging Hybrid-Memory Supercomputers2017In: Proceedings - 18th IEEE International Conference on High Performance Computing and Communications, 14th IEEE International Conference on Smart City and 2nd IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2016, Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 473-480, article id 7828415Conference paper (Refereed)
    Abstract [en]

    Next-generation supercomputers will feature more hierarchical and heterogeneous memory systems with different memory technologies working side-by-side. A critical question is whether at large scale existing HPC applications and emerging data-analytics workloads will have performance improvement or degradation on these systems. We propose a systematic and fair methodology to identify the trend of application performance on emerging hybrid-memory systems. We model the memory system of next-generation supercomputers as a combination of 'fast' and 'slow' memories. We then analyze performance and dynamic execution characteristics of a variety of workloads, from traditional scientific applications to emerging data analytics to compare traditional and hybrid-memory systems. Our results show that data analytics applications can clearly benefit from the new system design, especially at large scale. Moreover, hybrid-memory systems do not penalize traditional scientific applications, which may also show performance improvement.

  • 24.
    Peng, I. Bo
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Kestor, G.
    Gioiosa, R.
    Idle period propagation in message-passing applications2017In: Proceedings - 18th IEEE International Conference on High Performance Computing and Communications, 14th IEEE International Conference on Smart City and 2nd IEEE International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2016, Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 937-944, article id 7828475Conference paper (Refereed)
    Abstract [en]

    Idle periods on different processes of Message Passing applications are unavoidable. While the origin of idle periods on a single process is well understood as the effect of system and architectural random delays, yet it is unclear how these idle periods propagate from one process to another. It is important to understand idle period propagation in Message Passing applications as it allows application developers to design communication patterns avoiding idle period propagation and the consequent performance degradation in their applications. To understand idle period propagation, we introduce a methodology to trace idle periods when a process is waiting for data from a remote delayed process in MPI applications. We apply this technique in an MPI application that solves the heat equation to study idle period propagation on three different systems. We confirm that idle periods move between processes in the form of waves and that there are different stages in idle period propagation. Our methodology enables us to identify a self-synchronization phenomenon that occurs on two systems where some processes run slower than the other processes.

  • 25.
    Peng, Ivy Bo
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Data Movement on Emerging Large-Scale Parallel Systems2017Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Large-scale HPC systems are an important driver for solving computational problems in scientific communities. Next-generation HPC systems will not only grow in scale but also in heterogeneity. This increased system complexity entails more challenges to data movement in HPC applications. Data movement on emerging HPC systems requires asynchronous fine-grained communication and efficient data placement in the main memory. This thesis proposes an innovative programming model and algorithm to prepare HPC applications for the next computing era: (1) a data streaming model that supports emerging data-intensive applications on supercomputers, (2) a decoupling model that improves parallelism and mitigates the impact of imbalance in applications, (3) a new framework and methodology for predicting the impact of largescale heterogeneous memory systems on HPC applications, and (4) a data placement algorithm that uses a set of rules and a decision tree to determine the data-to-memory mapping in heterogeneous main memory.

    The proposed approaches in this thesis are evaluated on multiple supercomputers with different processors and interconnect networks. The evaluation uses a diverse set of applications that represent conventional scientific applications and emerging data-analytic workloads on HPC systems. The experimental results on the petascale testbed show that the approaches obtain increasing performance improvements as system scale increases and this trend supports the approaches as a valuable contribution towards future HPC systems.

    Download full text (pdf)
    fulltext
  • 26.
    Peng, Ivy Bo
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Gioiosa, R.
    Kestor, G.
    Cicotti, P.
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Exploring the performance benefit of hybrid memory system on HPC environments2017In: Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017, Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 683-692, article id 7965110Conference paper (Refereed)
    Abstract [en]

    Hardware accelerators have become a de-facto standard to achieve high performance on current supercomputers and there are indications that this trend will increase in the future. Modern accelerators feature high-bandwidth memory next to the computing cores. For example, the Intel Knights Landing (KNL) processor is equipped with 16 GB of high-bandwidth memory (HBM) that works together with conventional DRAM memory. Theoretically, HBM can provide ∼4× higher bandwidth than conventional DRAM. However, many factors impact the effective performance achieved by applications, including the application memory access pattern, the problem size, the threading level and the actual memory configuration. In this paper, we analyze the Intel KNL system and quantify the impact of the most important factors on the application performance by using a set of applications that are representative of scientific and data-analytics workloads. Our results show that applications with regular memory access benefit from MCDRAM, achieving up to 3× performance when compared to the performance obtained using only DRAM. On the contrary, applications with random memory access pattern are latency-bound and may suffer from performance degradation when using only MCDRAM. For those applications, the use of additional hardware threads may help hide latency and achieve higher aggregated bandwidth when using HBM.

  • 27.
    Peng, Ivy Bo
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Gioiosa, Roberto
    Kestor, Gokcen
    Cicotti, Pietro
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    RTHMS: A Tool for Data Placement on Hybrid Memory System2017In: Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management, ISMM 2017, Association for Computing Machinery (ACM) , 2017, Vol. 52, no 9, p. 82-91Conference paper (Refereed)
    Abstract [en]

    Traditional scientific and emerging data analytics applications require fast, power-efficient, large, and persistent memories. Combining all these characteristics within a single memory technology is expensive and hence future supercomputers will feature different memory technologies side-by-side. However, it is a complex task to program hybrid-memory systems and to identify the best object-to-memory mapping. We envision that programmers will probably resort to use default configurations that only require minimal interventions on the application code or system settings. In this work, we argue that intelligent, fine-grained data placement can achieve higher performance than default setups. We present an algorithm for data placement on hybrid-memory systems. Our algorithm is based on a set of single-object allocation rules and global data placement decisions. We also present RTHMS, a tool that implements our algorithm and provides recommendations about the object-to-memory mapping. Our experiments on a hybrid memory system, an Intel Knights Landing processor with DRAM and HBM, show that RTHMS is able to achieve higher performance than the default configuration. We believe that RTHMS will be a valuable tool for programmers working on complex hybrid-memory systems.

  • 28.
    Peng, Ivy Bo
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Gioiosa, Roberto
    Kestor, Gokcen
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Preparing HPC Applications for the Exascale Era: A Decoupling Strategy2017In: 2017 46th International Conference on Parallel Processing (ICPP), IEEE Computer Society, 2017, p. 1-10, article id 8025274Conference paper (Refereed)
    Abstract [en]

    Production-quality parallel applications are often a mixture of diverse operations, such as computation- and communication-intensive, regular and irregular, tightly coupled and loosely linked operations. In conventional construction of parallel applications, each process performs all the operations, which might result inefficient and seriously limit scalability, especially at large scale. We propose a decoupling strategy to improve the scalability of applications running on large-scale systems. Our strategy separates application operations onto groups of processes and enables a dataflow processing paradigm among the groups. This mechanism is effective in reducing the impact of load imbalance and increases the parallel efficiency by pipelining multiple operations. We provide a proof-of-concept implementation using MPI, the de-facto programming system on current supercomputers. We demonstrate the effectiveness of this strategy by decoupling the reduce, particle communication, halo exchange and I/O operations in a set of scientific and data-analytics applications. A performance evaluation on 8,192 processes of a Cray XC40 supercomputer shows that the proposed approach can achieve up to 4x performance improvement.

  • 29.
    Peng, Ivy Bo
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Gioiosa, Roberto
    Pacific Northwest Natl Lab, Computat Sci & Math Div, Richland, WA 99352 USA..
    Kestor, Gokcen
    Pacific Northwest Natl Lab, Computat Sci & Math Div, Richland, WA 99352 USA..
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    MPI Streams for HPC Applications2017In: New Frontiers in High Performance Computing and Big Data / [ed] Geoffrey Fox, Vladimir Getov, Lucio Grandinetti, Gerhard Joubert, Thomas Sterling, IOS Press, 2017, p. 75-92Conference paper (Refereed)
    Abstract [en]

    Data streams are a sequence of data flowing between source and destination processes. Streaming is widely used for signal, image and video processing for its efficiency in pipelining and effectiveness in reducing demand for memory. The goal of this work is to extend the use of data streams to support both conventional scientific applications and emerging data analytics applications running on HPC platforms. We introduce an extension called MPIStream to the de-facto programming standard on HPC, MPI. MPIStream supports data streams either within a single application or among multiple applications. We present three use cases using MPI streams in HPC applications together with their parallel performance. We show the convenience of using MPI streams to support the needs from both traditional HPC and emerging data analytics applications running on supercomputers.

  • 30.
    Peng, Ivy Bo
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Holmes, D.
    Bull, M.
    A Data streaming model in MPI2015In: Proceedings of the 3rd ExaMPI Workshop at the International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2015, ACM Digital Library, 2015Conference paper (Refereed)
    Abstract [en]

    Data streaming model is an effective way to tackle the chal-lenge of data-intensive applications. As traditional HPC applications generate large volume of data and more data-intensive applications move to HPC infrastructures, it is nec-essary to investigate the feasibility of combining message-passing and streaming programming models. MPI, the de facto standard for programming on HPC, cannot intuitively express the communication pattern and the functional op-erations required in streaming models. In this work, we de-signed and implemented a data streaming library MPIStream atop MPI to allocate data producers and consumers, to stream data continuously or irregularly and to process data at run-Time. In the same spirit as the STREAM benchmark, we developed a parallel stream benchmark to measure data processing rate. The performance of the library largely de-pends on the size of the stream element, the number of data producers and consumers and the computational intensity of processing one stream element. With 2,048 data produc-ers and 2,048 data consumers in the parallel benchmark, MPIStream achieved 200 GB/s processing rate on a Blue Gene/Q supercomputer. We illustrate that a streaming li-brary for HPC applications can effectively enable irregular parallel I/O, application monitoring and threshold collective operations. 

  • 31.
    Peng, Ivy Bo
    et al.
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Vencels, Juris
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Lapenta, Giovanni
    Divin, Andrey
    Vaivads, Andris
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Energetic particles in magnetotail reconnection2015In: Journal of Plasma Physics, ISSN 0022-3778, E-ISSN 1469-7807, Vol. 81, article id 325810202Article in journal (Refereed)
    Abstract [en]

    We carried out a 3D fully kinetic simulation of Earth's magnetotail magnetic reconnection to study the dynamics of energetic particles. We developed and implemented a new relativistic particle mover in iPIC3D, an implicit Particle-in-Cell code, to correctly model the dynamics of energetic particles. Before the onset of magnetic reconnection, energetic electrons are found localized close to current sheet and accelerated by lower hybrid drift instability. During magnetic reconnection, energetic particles are found in the reconnection region along the x-line and in the separatrices region. The energetic electrons are first present in localized stripes of the separatrices and finally cover all the separatrix surfaces. Along the separatrices, regions with strong electron deceleration are found. In the reconnection region, two categories of electron trajectory are identified. First, part of the electrons are trapped in the reconnection region, bouncing a few times between the outflow jets. Second, part of the electrons pass the reconnection region without being trapped. Different from electrons, energetic ions are localized on the reconnection fronts of the outflow jets.

  • 32.
    Rivas-Gomez, Sergei
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Gioiosa, R.
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Kestor, G.
    Narasimhamurthy, S.
    Laure, Erwin
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    MPI windows on storage for HPC applications2017In: ACM International Conference Proceeding Series, Association for Computing Machinery (ACM) , 2017Conference paper (Refereed)
    Abstract [en]

    Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI windows on storage, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. Using a modified STREAM micro-benchmark, we measure the sustained bandwidth of MPI windows on storage against MPI memory windows and observe that only a 10% performance penalty is incurred. When using parallel file systems such as Lustre, asymmetric performance is observed with a 10% performance penalty in reading operations and a 90% in writing operations. Nonetheless, experimental results of a Distributed Hash Table and the HACC I/O kernel mini-application show that the overall penalty of MPI windows on storage can be negligible in most cases on real-world applications. 

  • 33.
    Rivas-Gomez, Sergio
    et al.
    KTH, School of Computer Science and Communication (CSC).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Peng, Ivy Bo
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Laure, E.
    Kestor, G.
    Gioiosa, R.
    Extending message passing interface windows to storage2017In: Proceedings - 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017, Institute of Electrical and Electronics Engineers Inc. , 2017, p. 728-730Conference paper (Refereed)
    Abstract [en]

    This paper presents an extension to MPI supporting the one-sided communication model and window allocations in storage. Our design transparently integrates with the current MPI implementations, enabling applications to target MPI windows in storage, memory or both simultaneously, without major modifications. Initial performance results demonstrate that the presented MPI window extension could potentially be helpful for a wide-range of use-cases and with low-overhead.

  • 34. Toth, Gabor
    et al.
    Chen, Yuxi
    Gombosi, Tamas I.
    Cassak, Paul
    Markidis, Stefano
    KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Peng, Ivy Bo
    KTH.
    Scaling the Ion Inertial Length and Its Implications for Modeling Reconnection in Global Simulations2017In: Journal of Geophysical Research - Space Physics, ISSN 2169-9380, E-ISSN 2169-9402, Vol. 122, no 10, p. 10336-10355Article in journal (Refereed)
    Abstract [en]

    We investigate the use of artificially increased ion and electron kinetic scales in global plasma simulations. We argue that as long as the global and ion inertial scales remain well separated, (1) the overall global solution is not strongly sensitive to the value of the ion inertial scale, while (2) the ion inertial scale dynamics will also be similar to the original system, but it occurs at a larger spatial scale, and (3) structures at intermediate scales, such as magnetic islands, grow in a self-similar manner. To investigate the validity and limitations of our scaling hypotheses, we carry out many simulations of a two-dimensional magnetosphere with the magnetohydrodynamics with embedded particle-in-cell (MHD-EPIC) model. The PIC model covers the dayside reconnection site. The simulation results confirm that the hypotheses are true as long as the increased ion inertial length remains less than about 5% of the magnetopause standoff distance. Since the theoretical arguments are general, we expect these results to carry over to three dimensions. The computational cost is reduced by the third and fourth powers of the scaling factor in two-and three-dimensional simulations, respectively, which can be many orders of magnitude. The present results suggest that global simulations that resolve kinetic scales for reconnection are feasible. This is a crucial step for applications to the magnetospheres of Earth, Saturn, and Jupiter and to the solar corona.

  • 35. Toth, Gabor
    et al.
    Jia, Xianzhe
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Peng, Ivy Bo
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Chen, Yuxi
    Daldorff, Lars K. S.
    Tenishev, Valeriy M.
    Borovikov, Dmitry
    Haiducek, John D.
    Gombosi, Tamas I.
    Glocer, Alex
    Dorelli, John C.
    Extended magnetohydrodynamics with embedded particle-in-cell simulation of Ganymede's magnetosphere2016In: Journal of Geophysical Research - Space Physics, ISSN 2169-9380, E-ISSN 2169-9402, Vol. 121, no 2, p. 1273-1293Article in journal (Refereed)
    Abstract [en]

    We have recently developed a new modeling capability to embed the implicit particle-in-cell (PIC) model iPIC3D into the Block-Adaptive-Tree-Solarwind-Roe-Upwind-Scheme magnetohydrodynamic (MHD) model. The MHD with embedded PIC domains (MHD-EPIC) algorithm is a two-way coupled kinetic-fluid model. As one of the very first applications of the MHD-EPIC algorithm, we simulate the interaction between Jupiter's magnetospheric plasma and Ganymede's magnetosphere. We compare the MHD-EPIC simulations with pure Hall MHD simulations and compare both model results with Galileo observations to assess the importance of kinetic effects in controlling the configuration and dynamics of Ganymede's magnetosphere. We find that the Hall MHD and MHD-EPIC solutions are qualitatively similar, but there are significant quantitative differences. In particular, the density and pressure inside the magnetosphere show different distributions. For our baseline grid resolution the PIC solution is more dynamic than the Hall MHD simulation and it compares significantly better with the Galileo magnetic measurements than the Hall MHD solution. The power spectra of the observed and simulated magnetic field fluctuations agree extremely well for the MHD-EPIC model. The MHD-EPIC simulation also produced a few flux transfer events (FTEs) that have magnetic signatures very similar to an observed event. The simulation shows that the FTEs often exhibit complex 3-D structures with their orientations changing substantially between the equatorial plane and the Galileo trajectory, which explains the magnetic signatures observed during the magnetopause crossings. The computational cost of the MHD-EPIC simulation was only about 4 times more than that of the Hall MHD simulation. Key Points

  • 36. Vencels, J.
    et al.
    Delzanno, G. L.
    Manzini, G.
    Markidis, S.
    Peng, I. Bo
    KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST).
    Roytershteyn, V.
    SpectralPlasmaSolver: A Spectral Code for Multiscale Simulations of Collisionless, Magnetized Plasmas2016In: Journal of Physics, Conference Series, ISSN 1742-6588, E-ISSN 1742-6596, Vol. 719, no 1, article id 12022Article in journal (Refereed)
    Abstract [en]

    We present the design and implementation of a spectral code, called SpectralPlasmaSolver (SPS), for the solution of the multi-dimensional Vlasov-Maxwell equations. The method is based on a Hermite-Fourier decomposition of the particle distribution function. The code is written in Fortran and uses the PETSc library for solving the non-linear equations and preconditioning and the FFTW library for the convolutions. SPS is parallelized for shared- memory machines using OpenMP. As a verification example, we discuss simulations of the two-dimensional Orszag-Tang vortex problem and successfully compare them against a fully kinetic Particle-In-Cell simulation. An assessment of the performance of the code is presented, showing a significant improvement in the code running-time achieved by preconditioning, while strong scaling tests show a factor of 10 speed-up using 16 threads.

  • 37.
    Vencels, Juris
    et al.
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Delzanno, G. L.
    Johnson, A.
    Peng, I. Bo
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Laure, Erwin
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Markidis, Stefano
    KTH, School of Computer Science and Communication (CSC), High Performance Computing and Visualization (HPCViz).
    Spectral solver for multi-scale plasma physics simulations with dynamically adaptive number of moments2015In: Procedia Computer Science, Elsevier, 2015, no 1, p. 1148-1157Conference paper (Refereed)
    Abstract [en]

    A spectral method for kinetic plasma simulations based on the expansion of the velocity distribution function in a variable number of Hermite polynomials is presented. The method is based on a set of non-linear equations that is solved to determine the coefficients of the Hermite expansion satisfying the Vlasov and Poisson equations. In this paper, we first show that this technique combines the fluid and kinetic approaches into one framework. Second, we present an adaptive strategy to increase and decrease the number of Hermite functions dynamically during the simulation. The technique is applied to the Landau damping and two-stream instability test problems. Performance results show 21% and 47% saving of total simulation time in the Landau and two-stream instability test cases, respectively.

  • 38.
    Wahlgren, Jacob
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Schieffer, Gabin
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Gokhale, Maya
    Lawrence Livermore National Laboratory, Livermore, United States of America.
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    A Quantitative Approach for Adopting Disaggregated Memory in HPC Systems2023In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023, Association for Computing Machinery (ACM) , 2023, article id 60Conference paper (Refereed)
    Abstract [en]

    Memory disaggregation has recently been adopted in data centers to improve resource utilization, motivated by cost and sustainability. Recent studies on large-scale HPC facilities have also highlighted memory underutilization. A promising and non-disruptive option for memory disaggregation is rack-scale memory pooling, where node-local memory is supplemented by shared memory pools. This work outlines the prospects and requirements for adoption and clarifies several misconceptions. We propose a quantitative method for dissecting application requirements on the memory system from the top down in three levels, moving from general, to multi-tier memory systems, and then to memory pooling. We provide a multi-level profiling tool and LBench to facilitate the quantitative approach. We evaluate a set of representative HPC workloads on an emulated platform. Our results show that prefetching activities can significantly influence memory traffic profiles. Interference in memory pooling has varied impacts on applications, depending on their access ratios to memory tiers and arithmetic intensities. Finally, in two case studies, we show the benefits of our findings at the application and system levels, achieving 50% reduction in remote access and 13% speedup in BFS, and reducing performance variation of co-located workloads in interference-aware job scheduling.

  • 39.
    Williams, Jeremy J.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Araújo De Medeiros, Daniel
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Characterizing the Performance of the Implicit Massively Parallel Particle-in-Cell iPIC3D Code2023In: SC23 Proccedings: The International Conference for High Performance Computing, Networking, Storage, and Analysis, Denver, Colorado, USA, 2023Conference paper (Refereed)
    Abstract [en]

    Optimizing iPIC3D, an implicit Particle-in-Cell (PIC) code,for large-scale 3D plasma simulations is crucial for spaceand astrophysical applications. This work focuses on characterizing iPIC3D’s communication efficiency through strategic measures like optimal node placement, communicationand computation overlap, and load balancing. Profiling andtracing tools are employed to analyze iPIC3D’s communication efficiency and provide practical recommendations. Implementing optimized communication protocols addressesthe Geospace Environmental Modeling (GEM) magnetic reconnection challenges in plasma physics with more precisesimulations. This approach captures the complexities of 3Dplasma simulations, particularly in magnetic reconnection,advancing space and astrophysical research. 

  • 40.
    Williams, Jeremy J.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Tskhakaya, David
    Institute of Plasma Physics of the CAS, Prague, Czech Republic.
    Costea, Stefan
    LeCAD, University of Ljubljana, Ljubljana, Slovenia.
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Garcia-Gasulla, Marta
    Barcelona Supercomputing Center, Barcelona, Spain.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Leveraging HPC Profiling & Tracing Tools to Understand the Performance of Particle-in-Cell Monte Carlo Simulations2023In: arXiv:2306.16512 / [ed] Demetris Zeinalipour, Limassol, Cyprus: Springer Nature, 2023, article id arXiv:2306.16512Conference paper (Refereed)
    Abstract [en]

    Large-scale plasma simulations are critical for designing and developing next-generation fusion energy devices and modeling industrial plasmas. BIT1 is a massively parallel Particle-in-Cell code designed for specifically studying plasma material interaction in fusion devices. Its most salient characteristic is the inclusion of collision Monte Carlo models for different plasma species. In this work, we characterize single node, multiple nodes, and I/O performances of the BIT1 code in two realistic cases by using several HPC profilers, such as perf, IPM, Extrae/Paraver, and Darshan tools. We find that the BIT1 sorting function on-node performance is the main performance bottleneck. Strong scaling tests show a parallel performance of 77% and 96% on 2,560 MPI ranks for the two test cases. We demonstrate that communication, load imbalance and self-synchronization are important factors impacting the performance of the BIT1 on large-scale runs.

  • 41.
    Yu, Yiqun
    et al.
    Beihang Univ, Sch Space & Environm, Beijing, Peoples R China..
    Delzanno, Gian Luca
    Los Alamos Natl Lab, Los Alamos, NM USA..
    Jordanova, Vania
    Los Alamos Natl Lab, Los Alamos, NM USA..
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    PIC simulations of wave-particle interactions with an initial electron velocity distribution from a kinetic ring current model2018In: Journal of Atmospheric and Solar-Terrestrial Physics, ISSN 1364-6826, E-ISSN 1879-1824, Vol. 177, p. 169-178Article in journal (Refereed)
    Abstract [en]

    Whistler wave-particle interactions play an important role in the Earth inner magnetospheric dynamics and have been the subject of numerous investigations. By running a global kinetic ring current model (RAM-SCB) in a storm event occurred on Oct 23-24 2002, we obtain the ring current electron distribution at a selected location at MLT of 9 and L of 6 where the electron distribution is composed of a warm population in the form of a partial ring in the velocity space (with energy around 15 keV) in addition to a cool population with a Maxwellian-like distribution. The warm population is likely from the injected plasma sheet electrons during substorm injections that supply fresh source to the inner magnetosphere. These electron distributions are then used as input in an implicit particle-in-cell code (iPIC3D) to study whistler-wave generation and the subsequent wave-particle interactions. We find that whistler waves are excited and propagate in the quasi-parallel direction along the background magnetic field. Several different wave modes are instantaneously generated with different growth rates and frequencies. The wave mode at the maximum growth rate has a frequency around 0.62 omega(ce), which corresponds to a parallel resonant energy of 2.5 keV. Linear theory analysis of wave growth is in excellent agreement with the simulation results. These waves grow initially due to the injected warm electrons and are later damped due to cyclotron absorption by electrons whose energy is close to the resonant energy and can effectively attenuate waves. The warm electron population overall experiences net energy loss and anisotropy drop while moving along the diffusion surfaces towards regions of lower phase space density, while the cool electron population undergoes heating when the waves grow, suggesting the cross-population interactions.

1 - 41 of 41
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf