Recently, federated HPC and cloud resources are becoming increasingly strategic for providing diversified and geographically available computing resources. However, accessing data stores across HPC and cloud storage systems is challenging. Many cloud providers use object storage systems to support their clients in storing and retrieving data over the internet. One popular method is REST APIs atop the HTTP protocol, with Amazon's S3 APIs being supported by most vendors. In contrast, HPC systems are contained within their networks and tend to use parallel file systems with POSIX-like interfaces. This work addresses the challenge of diverse data stores on HPC and cloud systems by providing native object storage support through the unified MPI I/O interface in HPC applications. In particular, we provide a prototype library called LibCOS that transparently enables MPI applications running on HPC systems to access object storage on remote cloud systems. We evaluated LibCOS on a Ceph object storage system and a traditional HPC system. In addition, we conducted performance characterization of core S3 operations that enable individual and collective MPI I/O. Our evaluation in HACC, IOR, and BigSort shows that enabling diverse data stores on HPC and Cloud storage is feasible and can be transparently achieved through the widely adopted MPI I/O. Also, we show that a native object storage system like Ceph could improve the scalability of I/O operations in parallel applications.
The conventional model of resource allocation in HPC systems is static. Thus, a job cannot leverage newly available resources in the system or release underutilized resources during the execution. In this paper, we present Kub, a methodology that enables elastic execution of HPC workloads on Kubernetes so that the resources allocated to a job can be dynamically scaled during the execution. One main optimization of our method is to maximize the reuse of the originally allocated resources so that the disruption to the running job can be minimized. The scaling procedure is coordinated among nodes through remote procedure calls on Kubernetes for deploying workloads in the cloud. We evaluate our approach using one synthetic benchmark and two production-level MPI-based HPC applications - GRO-MACS and CM1. Our results demonstrate that the benefits of adapting the allocated resources depend on the workload characteristics. In the tested cases, a properly chosen scaling point for increasing resources during execution achieved up to 2x speedup. Also, the overhead of checkpointing and data reshuffling significantly influences the selection of optimal scaling points and requires application-specific knowledge.
We perform a three-dimensional (3-D) global simulation of Earth's magnetosphere with kinetic reconnection physics to study the flux transfer events (FTEs) and dayside magnetic reconnection with the recently developed magnetohydrodynamics with embedded particle-in-cell model. During the 1 h long simulation, the FTEs are generated quasi-periodically near the subsolar point and move toward the poles. We find that the magnetic field signature of FTEs at their early formation stage is similar to a "crater FTE," which is characterized by a magnetic field strength dip at the FTE center. After the FTE core field grows to a significant value, it becomes an FTE with typical flux rope structure. When an FTE moves across the cusp, reconnection between the FTE field lines and the cusp field lines can dissipate the FTE. The kinetic features are also captured by our model. A crescent electron phase space distribution is found near the reconnection site. A similar distribution is found for ions at the location where the Larmor electric field appears. The lower hybrid drift instability (LHDI) along the current sheet direction also arises at the interface of magnetosheath and magnetosphere plasma. The LHDI electric field is about 8 mV/m, and its dominant wavelength relative to the electron gyroradius agrees reasonably with Magnetospheric Multiscale (MMS) observations.
Computational intensive applications such as pattern recognition, and natural language processing, are increasingly popular on HPC systems. Many of these applications use deep-learning, a branch of machine learning, to determine the weights of artificial neural network nodes by minimizing a loss function. Such applications depend heavily on dense matrix multiplications, also called tensorial operations. The use of Graphics Processing Unit (GPU) has considerably speeded up deep-learning computations, leading to a Renaissance of the artificial neural network. Recently, the NVIDIA Volta GPU and the Google Tensor Processing Unit (TPU) have been specially designed to support deep-learning workloads. New programming models have also emerged for convenient expression of tensorial operations and deep-learning computational paradigms. An example of such new programming frameworks is TensorFlow, an open-source deep-learning library released by Google in 2015. TensorFlow expresses algorithms as a computational graph where nodes represent operations and edges between nodes represent data flow. Multi-dimensional data such as vectors and matrices which flows between operations are called Tensors. For this reason, computation problems need to be expressed as a computational graph. In particular, TensorFlow supports distributed computation with flexible assignment of operation and data to devices such as GPU and CPU on different computing nodes. Computation on devices are based on optimized kernels such as MKL, Eigen and cuBLAS. Inter-node communication can be through TCP and RDMA. This work attempts to evaluate the usability and expressiveness of the TensorFlow programming model for traditional HPC problems. As an illustration, we prototyped a distributed block matrix multiplication for large dense matrices which cannot be co-located on a single device and a Conjugate Gradient (CG) solver. We evaluate the difficulty of expressing traditional HPC algorithms using computational graphs and study the scalability of distributed TensorFlow on accelerated systems. Our preliminary result with distributed matrix multiplication shows that distributed computation on TensorFlow is extremely scalable. This study provides an initial investigation of new emerging programming models for HPC.
With the growing complexity of memory types, organizations, and placement, efficient use of memory systems remains a key objective to processing data-rich workloads. Heterogeneous memories including HBM, conventional DRAM, and persistent memory, both locally and network-attached, exhibit a wide range of latencies and bandwidths. The delivered performance to an application may vary widely depending on workload and interference from competing clients. Evaluating the impact on applications to these emerging memory systems challenges traditional simulation techniques. In this work, we describe VLD-sim, an FPGA-accelerated simulator designed to evaluate application performance in the presence of varying non-deterministic latency. VLD-sim implements a statistical approach in which memory system access latency is non-deterministic, as would occur when request traffic is generated from a large collection of possibly unrelated threads and compute nodes. VLD-sim runs on a Multi-Processor System on Chip with hard CPU plus configurable logic to enable fast evaluation of workloads or of individual applications. We evaluate VLD-sim with CPU-only and near memory accelerator-enabled applications and compare against an idealized fixed latency baseline. Our findings reveal and quantify performance impact on applications due to non-deterministic latency. With high flexibility and and fast execution time, VLD-sim enables system level evaluation of a large memory architecture design space.
Vlasiator is a popular and powerful massively parallel code for accurate magnetospheric and solar wind plasma simulations. This work provides an in-depth analysis of Vlasiator, focusing on MPI performance using the Integrated Performance Monitoring (IPM) tool. We show that MPI non-blocking point-to-point communication accounts for most of the communication time. The communication topology shows a large number of MPI messages exchanging data in a six-dimensional grid. We also show that relatively large messages are used in MPI communication, reaching up to 256MB. As a communication-bound application, we found that using OpenMP in Vlasiator is critical for eliminating intra-node communication. Our results provide important insights for optimizing Vlasiator for the upcoming Exascale machines.
We present an initial design and implementation of a Particle-in-Cell (PIC) method based on the work carried out in the European Exascale AllScale project. AllScale provides a unified programming system for the effective development of highly scalable, resilient and performance-portable parallel applications for Exascale systems. The AllScale approach is based on task-based nested recursive parallelism and it provides mechanisms for automatic load-balancing in the PIC simulations. We provide the preliminary results of the AllScale-based PIC implementation and draw directions for its future development.
Nekbone is a proxy application of Nek5000, a scalable Computational Fluid Dynamics (CFD) code used for modelling incompressible flows. The Nekbone mini-application is used by several international co-design centers to explore new concepts in computer science and to evaluate their performance. We present the design and implementation of a new communication kernel in the Nekbone mini-application with the goal of studying the performance of different parallel communication models. First, a new MPI blocking communication kernel has been developed to solve Nekbone problems in a three-dimensional Cartesian mesh and process topology. The new MPI implementation delivers a 13% performance improvement compared to the original implementation. The new MPI communication kernel consists of approximately 500 lines of code against the original 7,000 lines of code, allowing experimentation with new approaches in Nekbone parallel communication. Second, the MPI blocking communication in the new kernel was changed to the MPI non-blocking communication. Third, we developed a new Partitioned Global Address Space (PGAS) communication kernel, based on the GPI-2 library. This approach reduces the synchronization among neighbor processes and is on average 3% faster than the new MPI-based, non-blocking, approach. In our tests on 8,192 processes, the GPI-2 communication kernel is 3% faster than the new MPI non-blocking communication kernel. In addition, we have used the OpenMP in all the versions of the new communication kernel. Finally, we highlight the future steps for using the new communication kernel in the parent application Nek5000.
Collisionless shock nonstationarity arising from microscale physics influences shock structure and particle acceleration mechanisms. Nonstationarity has been difficult to quantify due to the small spatial and temporal scales. We use the closely spaced (subgyroscale), high-time-resolution measurements from one rapid crossing of Earth's quasiperpendicular bow shock by the Magnetospheric Multiscale (MMS) spacecraft to compare competing nonstationarity processes. Using MMS's high-cadence kinetic plasma measurements, we show that the shock exhibits nonstationarity in the form of ripples.
Heterogeneous Intellectual Property (IP) hardware acceleration engines have emerged as a viable path forward to improving performance in the waning of Moore's Law and Dennard scaling. In this study, we design, prototype, and evaluate the HPC-specialized ZHW floating point compression accelerator as a resource on a System on Chip (SoC). Our full hardware/software implementation and evaluation reveal inefficiencies at the system level that significantly throttle the potential speedup of the ZHW accelerator. By optimizing data movement between CPU, memory, and accelerator, 6.9X is possible compared to a RISC-V64 core, and 2.9X over a Mac M1 ARM core.
Mars Atmosphere and Volatile EvolutioN (MAVEN) mission observations show clear evidence of the occurrence of the magnetic reconnection process in the Martian plasma tail. In this study, we use sophisticated numerical models to help us understand the effects of magnetic reconnection in the plasma tail. The numerical models used in this study are (a) a multispecies global Hall-magnetohydrodynamic (HMHD) model and (b) a global HMHD model two-way coupled to an embedded fully kinetic particle-in-cell code. Comparison with MAVEN observations clearly shows that the general interaction pattern is well reproduced by the global HMHD model. The coupled model takes advantage of both the efficiency of the MHD model and the ability to incorporate kinetic processes of the particle-in-cell model, making it feasible to conduct kinetic simulations for Mars under realistic solar wind conditions for the first time. Results from the coupled model show that the Martian magnetotail is highly dynamic due to magnetic reconnection, and the resulting Mars-ward plasma flow velocities are significantly higher for the lighter ion fluid, which are quantitatively consistent with MAVEN observations. The HMHD with Embedded Particle-in-Cell model predicts that the ion loss rates are more variable but with similar mean values as compared with HMHD model results.
Streaming computing models allow for on-the-y processing of large data sets. With the increased demand for processing large amount of data in a reasonable period of time, streaming models are more and more used on supercomputers to solve data-intensive problems. Because supercomputers have been mainly used for compute-intensive workload, supercomputer performance metrics focus on the number of oating point operations in time and cannot fully characterize a streaming application performance on supercomputers. We introduce the injection and processing rates as the main metrics to characterize the performance of streaming computing on supercomputers. We analyze the dynamics of these quantities in a modi ed STREAM benchmark developed atop of an MPI streaming library in a series of di erent congurations. We show that after a brief transient the injection and processing rates converge to sustained rates. We also demonstrate that streaming computing performance strongly depends on the number of connections between data producers and consumers and on the processing task granularity.
EPiGRAM is a European Commission funded project to improve existing parallel programming models to run efficiently large scale applications on exascale supercomputers. The EPiGRAM project focuses on the two current dominant petascale programming models, message-passing and PGAS, and on the improvement of two of their associated programming systems, MPI and GASPI. In EPiGRAM, we work on two major aspects of programming systems. First, we improve the performance of communication operations by decreasing the memory consumption, improving collective operations and introducing emerging computing models. Second, we enhance the interoperability of message-passing and PGAS by integrating them in one PGAS-based MPI implementation, called EMPI4Re, implementing MPI endpoints and improving GASPI interoperability with MPI. The new EPiGRAM concepts are tested in two large-scale applications, iPIC3D, a Particle-in-Cell code for space physics simulations, and Nek5000, a Computational Fluid Dynamics code.
The vast majority of parallel scientific applications distributes computation among processes that are in a busy state when computing and in an idle state when waiting for information from other processes. We identify the propagation of idle waves through processes in scientific applications with a local information exchange between the two processes. Idle waves are nondispersive and have a phase velocity inversely proportional to the average busy time. The physical mechanism enabling the propagation of idle waves is the local synchronization between two processes due to remote data dependency. This study provides a description of the large number of processes in parallel scientific applications as a continuous medium. This work also is a step towards an understanding of how localized idle periods can affect remote processes, leading to the degradation of global performance in parallel scientific applications.
SAGE (Percipient StorAGe for Exascale Data Centric Computing) is a European Commission funded project towards the era of Exascale computing. Its goal is to design and implement a Big Data/Extreme Computing (BDEC) capable infrastructure with associated software stack. The SAGE system follows a storage centric approach as it is capable of storing and processing large data volumes at the Exascale regime. SAGE addresses the convergence of Big Data Analysis and HPC in an era of next-generation data centric computing. This convergence is driven by the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors where data needs to be processed, analyzed and integrated into simulations to derive scientific and innovative insights. A first prototype of the SAGE system has been been implemented and installed at the Jülich Supercomputing Center. The SAGE storage system consists of multiple types of storage device technologies in a multi-tier I/O hierarchy, including flash, disk, and non-volatile memory technologies. The main SAGE software component is the Seagate Mero Object Storage that is accessible via the Clovis API and higher level interfaces. The SAGE project also includes scientific applications for the validation of the SAGE concepts. The objective of this paper is to present the SAGE project concepts, the prototype of the SAGE platform and discuss the software architecture of the SAGE system.
We aim to implement a Big Data/Extreme Computing (BDEC) capable system infrastructure as we head towards the era of Exascale computing - termed SAGE (Percipient StorAGe for Exascale Data Centric Computing). The SAGE system will be capable of storing and processing immense volumes of data at the Exascale regime, and provide the capability for Exascale class applications to use such a storage infrastructure. SAGE addresses the increasing overlaps between Big Data Analysis and HPC in an era of next-generation data centric computing that has developed due to the proliferation of massive data sources, such as large, dispersed scientific instruments and sensors, whose data needs to be processed, analysed and integrated into simulations to derive scientific and innovative insights. Indeed, Exascale I/O, as a problem that has not been sufficiently dealt with for simulation codes, is appropriately addressed by the SAGE platform. The objective of this paper is to discuss the software architecture of the SAGE system and look at early results we have obtained employing some of its key methodologies, as the system continues to evolve.
We present a systematic attempt to study magnetic null points and the associated magnetic energy conversion in kinetic particle-in-cell simulations of various plasma configurations. We address three-dimensional simulations performed with the semi-implicit kinetic electromagnetic code iPic3D in different setups: variations of a Harris current sheet, dipolar and quadrupolar magnetospheres interacting with the solar wind,. and a relaxing turbulent configuration with multiple null points. Spiral nulls are more likely created in space plasmas: in all our simulations except lunar magnetic anomaly (LMA) and quadrupolar mini-magnetosphere the number of spiral nulls prevails over the number of radial nulls by a factor of 3-9. We show that often magnetic nulls do not indicate the regions of intensive energy dissipation. Energy dissipation events caused by topological bifurcations at radial nulls are rather rare and short-lived. The so-called X-lines formed by the radial nulls in the Harris current sheet and LMA simulations are rather stable and do not exhibit any energy dissipation. Energy dissipation is more powerful in the vicinity of spiral nulls enclosed by magnetic flux ropes with strong currents at their axes (their cross. sections resemble 2D magnetic islands). These null lines reminiscent of Z-pinches efficiently dissipate magnetic energy due to secondary instabilities such as the two-stream or kinking instability, accompanied by changes in magnetic topology. Current enhancements accompanied by spiral nulls may signal magnetic energy conversion sites in the observational data.
We demonstrate the improvements to an implicit Particle-in-Cell code, iPic3D, on the example of dipolar magnetic field immersed in the flow of the plasma and show the formation of a magnetosphere. We address the problem of modelling multi-scale phenomena during the formation of a magnetosphere by implementing an adaptive sub-cycling technique to resolve the motion of particles located close to the magnetic dipole centre, where the magnetic field intensity is maximum. In addition, we implemented new open boundary conditions to model the inflow and outflow of plasma. We present the results of a global three-dimensional Particle-in-Cell simulation and discuss the performance improvements from the adaptive sub-cycling technique.
Synchronization in message passing systems is achieved by communication among processes. System and architectural noise and different workloads cause processes to be imbalanced and to reach synchronization points at different time. Thus, both communication and imbalance impact the synchronization performance. In this paper, we study the algorithmic properties that allow the communication in synchronization to absorb the initial imbalance among processes. We quantify the imbalance absorption properties of different barrier algorithms using a LogP Monte Carlo simulator. We found that linear and f-way tournament barriers can absorb up to 95% of random exponential imbalance with the standard deviation equal to the communication time for one message. Dissemination, butterfly and pairwise exchange barriers, on the other hand, do not absorb imbalance but can effectively bound the post-barrier imbalance. We identify that synchronization transits from communication-dominated to imbalance-dominated when the standard deviation of imbalance distribution is more than twice the communication time for one message. In our study, f-way tournament barriers provided the best imbalance absorption rate and convenient communication time.
We carried out global Particle-in-Cell simulations of the interaction between the solar wind and a magnetosphere to study the kinetic collisionless physics in super-critical quasi-perpendicular shocks. After an initial simulation transient, a collisionless bow shock forms as a result of the interaction of the solar wind and a planet magnetic dipole. The shock ramp has a thickness of approximately one ion skin depth and is followed by a trailing wave train in the shock downstream. At the downstream edge of the bow shock, whistler waves propagate along the magnetic field lines and the presence of electron cyclotron waves has been identified. A small part of the solar wind ion population is specularly reflected by the shock while a larger part is deflected and heated by the shock. Solar wind ions and electrons are heated in the perpendicular directions. Ions are accelerated in the perpendicular direction in the trailing wave train region. This work is an initial effort to study the electron and ion kinetic effects developed near the bow shock in a realistic magnetic field configuration.
Next-generation supercomputers will feature more hierarchical and heterogeneous memory systems with different memory technologies working side-by-side. A critical question is whether at large scale existing HPC applications and emerging data-analytics workloads will have performance improvement or degradation on these systems. We propose a systematic and fair methodology to identify the trend of application performance on emerging hybrid-memory systems. We model the memory system of next-generation supercomputers as a combination of 'fast' and 'slow' memories. We then analyze performance and dynamic execution characteristics of a variety of workloads, from traditional scientific applications to emerging data analytics to compare traditional and hybrid-memory systems. Our results show that data analytics applications can clearly benefit from the new system design, especially at large scale. Moreover, hybrid-memory systems do not penalize traditional scientific applications, which may also show performance improvement.
Idle periods on different processes of Message Passing applications are unavoidable. While the origin of idle periods on a single process is well understood as the effect of system and architectural random delays, yet it is unclear how these idle periods propagate from one process to another. It is important to understand idle period propagation in Message Passing applications as it allows application developers to design communication patterns avoiding idle period propagation and the consequent performance degradation in their applications. To understand idle period propagation, we introduce a methodology to trace idle periods when a process is waiting for data from a remote delayed process in MPI applications. We apply this technique in an MPI application that solves the heat equation to study idle period propagation on three different systems. We confirm that idle periods move between processes in the form of waves and that there are different stages in idle period propagation. Our methodology enables us to identify a self-synchronization phenomenon that occurs on two systems where some processes run slower than the other processes.
Large-scale HPC systems are an important driver for solving computational problems in scientific communities. Next-generation HPC systems will not only grow in scale but also in heterogeneity. This increased system complexity entails more challenges to data movement in HPC applications. Data movement on emerging HPC systems requires asynchronous fine-grained communication and efficient data placement in the main memory. This thesis proposes an innovative programming model and algorithm to prepare HPC applications for the next computing era: (1) a data streaming model that supports emerging data-intensive applications on supercomputers, (2) a decoupling model that improves parallelism and mitigates the impact of imbalance in applications, (3) a new framework and methodology for predicting the impact of largescale heterogeneous memory systems on HPC applications, and (4) a data placement algorithm that uses a set of rules and a decision tree to determine the data-to-memory mapping in heterogeneous main memory.
The proposed approaches in this thesis are evaluated on multiple supercomputers with different processors and interconnect networks. The evaluation uses a diverse set of applications that represent conventional scientific applications and emerging data-analytic workloads on HPC systems. The experimental results on the petascale testbed show that the approaches obtain increasing performance improvements as system scale increases and this trend supports the approaches as a valuable contribution towards future HPC systems.
Hardware accelerators have become a de-facto standard to achieve high performance on current supercomputers and there are indications that this trend will increase in the future. Modern accelerators feature high-bandwidth memory next to the computing cores. For example, the Intel Knights Landing (KNL) processor is equipped with 16 GB of high-bandwidth memory (HBM) that works together with conventional DRAM memory. Theoretically, HBM can provide ∼4× higher bandwidth than conventional DRAM. However, many factors impact the effective performance achieved by applications, including the application memory access pattern, the problem size, the threading level and the actual memory configuration. In this paper, we analyze the Intel KNL system and quantify the impact of the most important factors on the application performance by using a set of applications that are representative of scientific and data-analytics workloads. Our results show that applications with regular memory access benefit from MCDRAM, achieving up to 3× performance when compared to the performance obtained using only DRAM. On the contrary, applications with random memory access pattern are latency-bound and may suffer from performance degradation when using only MCDRAM. For those applications, the use of additional hardware threads may help hide latency and achieve higher aggregated bandwidth when using HBM.
Traditional scientific and emerging data analytics applications require fast, power-efficient, large, and persistent memories. Combining all these characteristics within a single memory technology is expensive and hence future supercomputers will feature different memory technologies side-by-side. However, it is a complex task to program hybrid-memory systems and to identify the best object-to-memory mapping. We envision that programmers will probably resort to use default configurations that only require minimal interventions on the application code or system settings. In this work, we argue that intelligent, fine-grained data placement can achieve higher performance than default setups. We present an algorithm for data placement on hybrid-memory systems. Our algorithm is based on a set of single-object allocation rules and global data placement decisions. We also present RTHMS, a tool that implements our algorithm and provides recommendations about the object-to-memory mapping. Our experiments on a hybrid memory system, an Intel Knights Landing processor with DRAM and HBM, show that RTHMS is able to achieve higher performance than the default configuration. We believe that RTHMS will be a valuable tool for programmers working on complex hybrid-memory systems.
Production-quality parallel applications are often a mixture of diverse operations, such as computation- and communication-intensive, regular and irregular, tightly coupled and loosely linked operations. In conventional construction of parallel applications, each process performs all the operations, which might result inefficient and seriously limit scalability, especially at large scale. We propose a decoupling strategy to improve the scalability of applications running on large-scale systems. Our strategy separates application operations onto groups of processes and enables a dataflow processing paradigm among the groups. This mechanism is effective in reducing the impact of load imbalance and increases the parallel efficiency by pipelining multiple operations. We provide a proof-of-concept implementation using MPI, the de-facto programming system on current supercomputers. We demonstrate the effectiveness of this strategy by decoupling the reduce, particle communication, halo exchange and I/O operations in a set of scientific and data-analytics applications. A performance evaluation on 8,192 processes of a Cray XC40 supercomputer shows that the proposed approach can achieve up to 4x performance improvement.
Data streams are a sequence of data flowing between source and destination processes. Streaming is widely used for signal, image and video processing for its efficiency in pipelining and effectiveness in reducing demand for memory. The goal of this work is to extend the use of data streams to support both conventional scientific applications and emerging data analytics applications running on HPC platforms. We introduce an extension called MPIStream to the de-facto programming standard on HPC, MPI. MPIStream supports data streams either within a single application or among multiple applications. We present three use cases using MPI streams in HPC applications together with their parallel performance. We show the convenience of using MPI streams to support the needs from both traditional HPC and emerging data analytics applications running on supercomputers.
Data streaming model is an effective way to tackle the chal-lenge of data-intensive applications. As traditional HPC applications generate large volume of data and more data-intensive applications move to HPC infrastructures, it is nec-essary to investigate the feasibility of combining message-passing and streaming programming models. MPI, the de facto standard for programming on HPC, cannot intuitively express the communication pattern and the functional op-erations required in streaming models. In this work, we de-signed and implemented a data streaming library MPIStream atop MPI to allocate data producers and consumers, to stream data continuously or irregularly and to process data at run-Time. In the same spirit as the STREAM benchmark, we developed a parallel stream benchmark to measure data processing rate. The performance of the library largely de-pends on the size of the stream element, the number of data producers and consumers and the computational intensity of processing one stream element. With 2,048 data produc-ers and 2,048 data consumers in the parallel benchmark, MPIStream achieved 200 GB/s processing rate on a Blue Gene/Q supercomputer. We illustrate that a streaming li-brary for HPC applications can effectively enable irregular parallel I/O, application monitoring and threshold collective operations.
We carried out a 3D fully kinetic simulation of Earth's magnetotail magnetic reconnection to study the dynamics of energetic particles. We developed and implemented a new relativistic particle mover in iPIC3D, an implicit Particle-in-Cell code, to correctly model the dynamics of energetic particles. Before the onset of magnetic reconnection, energetic electrons are found localized close to current sheet and accelerated by lower hybrid drift instability. During magnetic reconnection, energetic particles are found in the reconnection region along the x-line and in the separatrices region. The energetic electrons are first present in localized stripes of the separatrices and finally cover all the separatrix surfaces. Along the separatrices, regions with strong electron deceleration are found. In the reconnection region, two categories of electron trajectory are identified. First, part of the electrons are trapped in the reconnection region, bouncing a few times between the outflow jets. Second, part of the electrons pass the reconnection region without being trapped. Different from electrons, energetic ions are localized on the reconnection fronts of the outflow jets.
Upcoming HPC clusters will feature hybrid memories and storage devices per compute node. In this work, we propose to use the MPI one-sided communication model and MPI windows as unique interface for programming memory and storage. We describe the design and implementation of MPI windows on storage, and present its benefits for out-of-core execution, parallel I/O and fault-tolerance. Using a modified STREAM micro-benchmark, we measure the sustained bandwidth of MPI windows on storage against MPI memory windows and observe that only a 10% performance penalty is incurred. When using parallel file systems such as Lustre, asymmetric performance is observed with a 10% performance penalty in reading operations and a 90% in writing operations. Nonetheless, experimental results of a Distributed Hash Table and the HACC I/O kernel mini-application show that the overall penalty of MPI windows on storage can be negligible in most cases on real-world applications.
This paper presents an extension to MPI supporting the one-sided communication model and window allocations in storage. Our design transparently integrates with the current MPI implementations, enabling applications to target MPI windows in storage, memory or both simultaneously, without major modifications. Initial performance results demonstrate that the presented MPI window extension could potentially be helpful for a wide-range of use-cases and with low-overhead.
We investigate the use of artificially increased ion and electron kinetic scales in global plasma simulations. We argue that as long as the global and ion inertial scales remain well separated, (1) the overall global solution is not strongly sensitive to the value of the ion inertial scale, while (2) the ion inertial scale dynamics will also be similar to the original system, but it occurs at a larger spatial scale, and (3) structures at intermediate scales, such as magnetic islands, grow in a self-similar manner. To investigate the validity and limitations of our scaling hypotheses, we carry out many simulations of a two-dimensional magnetosphere with the magnetohydrodynamics with embedded particle-in-cell (MHD-EPIC) model. The PIC model covers the dayside reconnection site. The simulation results confirm that the hypotheses are true as long as the increased ion inertial length remains less than about 5% of the magnetopause standoff distance. Since the theoretical arguments are general, we expect these results to carry over to three dimensions. The computational cost is reduced by the third and fourth powers of the scaling factor in two-and three-dimensional simulations, respectively, which can be many orders of magnitude. The present results suggest that global simulations that resolve kinetic scales for reconnection are feasible. This is a crucial step for applications to the magnetospheres of Earth, Saturn, and Jupiter and to the solar corona.
We have recently developed a new modeling capability to embed the implicit particle-in-cell (PIC) model iPIC3D into the Block-Adaptive-Tree-Solarwind-Roe-Upwind-Scheme magnetohydrodynamic (MHD) model. The MHD with embedded PIC domains (MHD-EPIC) algorithm is a two-way coupled kinetic-fluid model. As one of the very first applications of the MHD-EPIC algorithm, we simulate the interaction between Jupiter's magnetospheric plasma and Ganymede's magnetosphere. We compare the MHD-EPIC simulations with pure Hall MHD simulations and compare both model results with Galileo observations to assess the importance of kinetic effects in controlling the configuration and dynamics of Ganymede's magnetosphere. We find that the Hall MHD and MHD-EPIC solutions are qualitatively similar, but there are significant quantitative differences. In particular, the density and pressure inside the magnetosphere show different distributions. For our baseline grid resolution the PIC solution is more dynamic than the Hall MHD simulation and it compares significantly better with the Galileo magnetic measurements than the Hall MHD solution. The power spectra of the observed and simulated magnetic field fluctuations agree extremely well for the MHD-EPIC model. The MHD-EPIC simulation also produced a few flux transfer events (FTEs) that have magnetic signatures very similar to an observed event. The simulation shows that the FTEs often exhibit complex 3-D structures with their orientations changing substantially between the equatorial plane and the Galileo trajectory, which explains the magnetic signatures observed during the magnetopause crossings. The computational cost of the MHD-EPIC simulation was only about 4 times more than that of the Hall MHD simulation. Key Points
We present the design and implementation of a spectral code, called SpectralPlasmaSolver (SPS), for the solution of the multi-dimensional Vlasov-Maxwell equations. The method is based on a Hermite-Fourier decomposition of the particle distribution function. The code is written in Fortran and uses the PETSc library for solving the non-linear equations and preconditioning and the FFTW library for the convolutions. SPS is parallelized for shared- memory machines using OpenMP. As a verification example, we discuss simulations of the two-dimensional Orszag-Tang vortex problem and successfully compare them against a fully kinetic Particle-In-Cell simulation. An assessment of the performance of the code is presented, showing a significant improvement in the code running-time achieved by preconditioning, while strong scaling tests show a factor of 10 speed-up using 16 threads.
A spectral method for kinetic plasma simulations based on the expansion of the velocity distribution function in a variable number of Hermite polynomials is presented. The method is based on a set of non-linear equations that is solved to determine the coefficients of the Hermite expansion satisfying the Vlasov and Poisson equations. In this paper, we first show that this technique combines the fluid and kinetic approaches into one framework. Second, we present an adaptive strategy to increase and decrease the number of Hermite functions dynamically during the simulation. The technique is applied to the Landau damping and two-stream instability test problems. Performance results show 21% and 47% saving of total simulation time in the Landau and two-stream instability test cases, respectively.
Memory disaggregation has recently been adopted in data centers to improve resource utilization, motivated by cost and sustainability. Recent studies on large-scale HPC facilities have also highlighted memory underutilization. A promising and non-disruptive option for memory disaggregation is rack-scale memory pooling, where node-local memory is supplemented by shared memory pools. This work outlines the prospects and requirements for adoption and clarifies several misconceptions. We propose a quantitative method for dissecting application requirements on the memory system from the top down in three levels, moving from general, to multi-tier memory systems, and then to memory pooling. We provide a multi-level profiling tool and LBench to facilitate the quantitative approach. We evaluate a set of representative HPC workloads on an emulated platform. Our results show that prefetching activities can significantly influence memory traffic profiles. Interference in memory pooling has varied impacts on applications, depending on their access ratios to memory tiers and arithmetic intensities. Finally, in two case studies, we show the benefits of our findings at the application and system levels, achieving 50% reduction in remote access and 13% speedup in BFS, and reducing performance variation of co-located workloads in interference-aware job scheduling.
Optimizing iPIC3D, an implicit Particle-in-Cell (PIC) code,for large-scale 3D plasma simulations is crucial for spaceand astrophysical applications. This work focuses on characterizing iPIC3D’s communication efficiency through strategic measures like optimal node placement, communicationand computation overlap, and load balancing. Profiling andtracing tools are employed to analyze iPIC3D’s communication efficiency and provide practical recommendations. Implementing optimized communication protocols addressesthe Geospace Environmental Modeling (GEM) magnetic reconnection challenges in plasma physics with more precisesimulations. This approach captures the complexities of 3Dplasma simulations, particularly in magnetic reconnection,advancing space and astrophysical research.
Large-scale plasma simulations are critical for designing and developing next-generation fusion energy devices and modeling industrial plasmas. BIT1 is a massively parallel Particle-in-Cell code designed for specifically studying plasma material interaction in fusion devices. Its most salient characteristic is the inclusion of collision Monte Carlo models for different plasma species. In this work, we characterize single node, multiple nodes, and I/O performances of the BIT1 code in two realistic cases by using several HPC profilers, such as perf, IPM, Extrae/Paraver, and Darshan tools. We find that the BIT1 sorting function on-node performance is the main performance bottleneck. Strong scaling tests show a parallel performance of 77% and 96% on 2,560 MPI ranks for the two test cases. We demonstrate that communication, load imbalance and self-synchronization are important factors impacting the performance of the BIT1 on large-scale runs.
Whistler wave-particle interactions play an important role in the Earth inner magnetospheric dynamics and have been the subject of numerous investigations. By running a global kinetic ring current model (RAM-SCB) in a storm event occurred on Oct 23-24 2002, we obtain the ring current electron distribution at a selected location at MLT of 9 and L of 6 where the electron distribution is composed of a warm population in the form of a partial ring in the velocity space (with energy around 15 keV) in addition to a cool population with a Maxwellian-like distribution. The warm population is likely from the injected plasma sheet electrons during substorm injections that supply fresh source to the inner magnetosphere. These electron distributions are then used as input in an implicit particle-in-cell code (iPIC3D) to study whistler-wave generation and the subsequent wave-particle interactions. We find that whistler waves are excited and propagate in the quasi-parallel direction along the background magnetic field. Several different wave modes are instantaneously generated with different growth rates and frequencies. The wave mode at the maximum growth rate has a frequency around 0.62 omega(ce), which corresponds to a parallel resonant energy of 2.5 keV. Linear theory analysis of wave growth is in excellent agreement with the simulation results. These waves grow initially due to the injected warm electrons and are later damped due to cyclotron absorption by electrons whose energy is close to the resonant energy and can effectively attenuate waves. The warm electron population overall experiences net energy loss and anisotropy drop while moving along the diffusion surfaces towards regions of lower phase space density, while the cool electron population undergoes heating when the waves grow, suggesting the cross-population interactions.