Studies of complex physical and engineering systems, represented by multi-scale and multi-physics computer simulations have an increasing demand for computing power, especially when the simulations of realistic problems are considered. This demand is driven by the increasing size and complexity of the studied systems or the time constraints. Ultrascale computing systems offer a possible solution to this problem. Future ultrascale systems will be large-scale complex computing systems combining technologies from high performance computing, distributed systems, big data, and cloud computing. Thus, the challenge of developing and programming complex algorithms on these systems is twofold. Firstly, the complex algorithms have to be either developed from scratch, or redesigned in order to yield high performance, while retaining correct functional behaviour. Secondly, ultrascale computing systems impose a number of non-functional cross-cutting concerns, such as fault tolerance or energy consumption, which can significantly impact the deployment of applications on large complex systems. This article discusses the state-of-the-art of programming for current and future large scale systems with an emphasis on complex applications. We derive a number of programming and execution support requirements by studying several computing applications that the authors are currently developing and discuss their potential and necessary upgrades for ultrascale execution.
This paper focuses on using the Newton-Raphson method to solve the power-flow problems. Since the most computationally demanding part of the Newton-Raphson method is to solve the linear equations at each iteration, this study investigates different approaches to solve the linear equations on both central processing unit (CPU) and graphical processing unit (GPU). Six different approaches have been developed and evaluated in this paper: two approaches of these run entirely on CPU while other two of these run entirely on GPU, and the remaining two are hybrid approaches that run on both CPU and GPU. All six direct linear solvers use either LU or QR factorization to solve the linear equations. Two different hardware platforms have been used to conduct the experiments. The performance results show that the CPU version with LU factorization gives better performance compared to the GPU version using standard library called cuSOLVER even for the larger power-flow problems. Moreover, it has been proven that the best performance is achieved using a hybrid method where the Jacobian matrix is assembled on GPU, the preprocessing with a sparse high performance linear solver called KLU is performed on the CPU in the first iteration, and the linear equation is factorized on the GPU and solved on the CPU. Maximum speed up in this study is obtained on the largest case with 25000 buses. The hybrid version shows a speedup factor of 9.6 with a NVIDIA P100 GPU while 13.1 with a NVIDIA V100 GPU in comparison with baseline CPU version on an Intel Xeon Gold 6132 CPU.
Stable and conservative interface boundary conditions are developed for the unsteady compressible Navier-Stokes equations using finite difference and finite volume methods. The finite difference approach is based on summation-by-part operators and can be made higher order accurate with boundary conditions imposed weakly. The finite volume approach is an edge- and dual grid-based approach for unstructured grids, formally second order accurate in space, with weak boundary conditions as well. Stable and conservative weak boundary conditions are derived for interfaces between finite difference methods, for finite volume methods and for the coupling between the two approaches. The three types of interface boundary conditions are demonstrated for two test cases. Firstly, inviscid vortex propagation with a known analytical solution is considered. The results show expected error decays as the grid is refined for various couplings and spatial accuracy of the finite difference scheme. The second test case involves viscous laminar flow over a cylinder with vortex shedding. Calculations with various coupling and spatial accuracies of the finite difference solver show that the couplings work as expected and that the higher order finite difference schemes provide enhanced vortex propagation.
We present a hybrid GPU implementation and performance analysis of Nekbone, which represents one of the core kernels of the incompressible Navier-Stokes solver Nek5000. The implementation is based on OpenACC and CUDA Fortran for local parallelization of the compute-intensive matrix-matrix multiplication part, which significantly minimizes the modification of the existing CPU code while extending the simulation capability of the code to GPU architectures. Our discussion includes the GPU results of OpenACC interoperating with CUDA Fortran and the gather-scatter operations with GPUDirect communication. We demonstrate performance of up to 552 Tflops on 16, 384 GPUs of the OLCF Cray XK7 Titan.
Accelerators and, in particular, Graphics Processing Units (GPUs) have emerged as promising computing technologies which may be suitable for the future Exascale systems. Here, we present performance results of NekBone, a benchmark of the Nek5000 code, implemented with optimized OpenACC directives and GPUDirect communications. Nek5000 is a computational fluid dynamics code based on the spectral element method used for the simulation of incompressible flow. Results of an optimized NekBone version lead to 78 Gflops performance on a single node. In addition, a performance result of 609 Tflops has been reached on 16, 384 GPUs of the Titan supercomputer at Oak Ridge National Laboratory.
Nek5000 is a computational fluid dynamics code based on the spectral element method used for the simulation of incompressible flows. We follow up on an earlier study which ported the simplified version of Nek5000 to a GPU-accelerated system by presenting the hybrid CPU/GPU implementation of the full Nek5000 code using OpenACC. The matrix-matrix multiplication, the Nek5000 gather-scatter operator and a preconditioned Conjugate Gradient solver have implemented using OpenACC for multi-GPU systems. We report an speed-up of 1.3 on single node of a Cray XK6 when using OpenACC directives in Nek5000. On 512 nodes of the Titan supercomputer, the speed-up can be approached to 1.4. A performance analysis of the Nek5000 code using Score-P and Vampir performance monitoring tools shows that overlapping of GPU kernels with host-accelerator memory transfers would considerably increase the performance of the OpenACC version of Nek5000 code.
In this paper, we present a stable hybrid scheme for viscous problems. The hybrid method combines the unstructured finite volume method with high-order finite difference methods on complex geometries. The coupling procedure between the two numerical methods is based on energy estimates and stable interface conditions are constructed. Numerical calculations show that the hybrid method is efficient and accurate.
We investigate several existing interface procedures for finite difference methods applied to advection-diffusion problems. The accuracy, stiffness and reflecting properties of various interface procedures are investigated. The analysis and numerical experiments show that there are only minor differences between various methods once a proper parameter choice has been made.
Nekbone is a proxy application of Nek5000, a scalable Computational Fluid Dynamics (CFD) code used for modelling incompressible flows. The Nekbone mini-application is used by several international co-design centers to explore new concepts in computer science and to evaluate their performance. We present the design and implementation of a new communication kernel in the Nekbone mini-application with the goal of studying the performance of different parallel communication models. First, a new MPI blocking communication kernel has been developed to solve Nekbone problems in a three-dimensional Cartesian mesh and process topology. The new MPI implementation delivers a 13% performance improvement compared to the original implementation. The new MPI communication kernel consists of approximately 500 lines of code against the original 7,000 lines of code, allowing experimentation with new approaches in Nekbone parallel communication. Second, the MPI blocking communication in the new kernel was changed to the MPI non-blocking communication. Third, we developed a new Partitioned Global Address Space (PGAS) communication kernel, based on the GPI-2 library. This approach reduces the synchronization among neighbor processes and is on average 3% faster than the new MPI-based, non-blocking, approach. In our tests on 8,192 processes, the GPI-2 communication kernel is 3% faster than the new MPI non-blocking communication kernel. In addition, we have used the OpenMP in all the versions of the new communication kernel. Finally, we highlight the future steps for using the new communication kernel in the parent application Nek5000.
GPU-accelerated computing is becoming a popular technology due to the emergence of techniques such as OpenACC, which makes it easy to port codes in their original form to GPU systems using compiler directives, and thereby speeding up computation times relatively simply. In this study we have developed an OpenACC implementation of the high order finite difference CFD solver ESSENSE for simulating compressible flows. The solver is based on summation-by-part form difference operators, and the boundary and interface conditions are weakly implemented using simultaneous approximation terms. This case study focuses on porting code to GPUs for the most time-consuming parts namely sparse matrix vector multiplications and the evaluations of fluxes. The resulting OpenACC implementation is used to simulate the Taylor-Green vortex which produces a maximum speed-up of 61.3 on a single V100 GPU by compared to serial CPU version.
We present a case study of porting NekBone, a skeleton version of the Nek5000 code, to a parallel GPU-accelerated system. Nek5000 is a computational fluid dynamics code based on the spectral element method used for the simulation of incompressible flow. The original NekBone Fortran source code has been used as the base and enhanced by OpenACC directives. The profiling of NekBone provided an assessment of the suitability of the code for GPU systems, and indicated possible kernel optimizations. To port NekBone to GPU systems required little effort and a small number of additional lines of code (approximately one OpenACC directive per 1000 lines of code). The naïve implementation using OpenACC leads to little performance improvement: on a single node, from 16 Gflops obtained with the version without OpenACC, we reached 20 Gflops with the naïve OpenACC implementation. An optimized NekBone version leads to a 43 Gflop performance on a single node. In addition, we ported and optimized NekBone to parallel GPU systems, reaching a parallel efficiency of 79.9% on 1024 GPUs of the Titan XK7 supercomputer at the Oak Ridge National Laboratory.
From the '50s, with the introduction of the first semi-planing hull of Nelson, which allowed to navigate with a certain tranquility at speeds higher than those of traditional hulls, and with the subsequent availability of more powerful engines, have been reached a speed equal to Fn greater than 0.6, which defines planing hulls. It was created so a clear distinction between displacement and planing hulls, in relation to the performances. The need to have naval units displacing faster has pushed the ship design to achieve increasingly high performance hulls, also focusing on the use of lightweight materials such as aluminum and more powerful engines, etc., but without substantially changing the traditional forms of hull. The patented hull Monotricat high hydrodynamic efficiency and energy saving represents the overcoming of this distinction between displacement and planing hulls, because, unlike previous solutions, is configured as the first hull that combines the characteristics of displacement and planning hull, since it presents an innovative architecture that could be defined as a hybrid between a monohull and catamaran, navigating on spray self-produced. This presentation will show how the hull Monotricat is the first displacement hull that can navigate at both displacement and planning speeds, with a resistance curve almost straight, maintaining the characteristics of a displacement hull. For these reasons the Monotricat hull is able to ensure: safety, comfort navigation, best seakeeping and maneuverability in restricted waters, stability, reduction of resistance to motion, cost management, regularity on the routes even in adverse weather-sea. These characteristics of the hull have been studied, tested and validated by leading research institutes and universities with more ameliorative results in each subsequent experimentation, reported in the present work, which demonstrated a greater hydrodynamic efficiency compared to conventional hulls tending to 20%.
We discuss how to combine the node based unstructured finite volume method widely used to handle complex geometries and nonlinear phenomena with very efficient high order finite difference methods suitable for wave propagation dominated problems. This fully coupled numerical procedure reflects the coupled character of the sound generation and propagation problem. The coupling procedure is based on energy estimates and stability can be guaranteed. Numerical experiments using finite difference methods that shed light on the theoretical results are performed. To cite this article: J. Nordstrom, J. Gong, C R. Mecanique 333 (2005).
A stable hybrid method for hyperbolic problems that combines the unstructured finite volume method with high-order finite difference methods has been developed. The coupling procedure is based on energy estimates and stability can be guaranteed. Numerical calculations verify that the hybrid method is efficient and accurate.
A stable and conservative high order multi-block method for the time-dependent compressible Navier-Stokes equations has been developed. Stability and conservation are proved using summation-by-parts operators, weak interface conditions and the energy method. This development makes it possible to exploit the efficiency of the high order finite difference method for non-trivial geometries. The computational results corroborate the theoretical analysis.
We show how a stable and accurate hybrid procedure for fluid flow can be constructed.Two separate solvers, one using high order finite difference methods andanother using the node-centered unstructured finite volume method are coupled ina truly stable way. The two flow solvers run independently and receive and sendinformation from each other by using a third coupling code. Exact solutions to theEuler equations are used to verify the accuracy and stability of the new computationalprocedure. We also demonstrate the capability of the new procedure in acalculation of the flow in and around a model of a coral.
We show how a stable and accurate hybrid procedure for fluid flow can be constructed. Two separate solvers, one using high order finite difference methods and another using the node-centered unstructured finite volume method are coupled in a truly stable way. The two flow solvers run independently and receive and send information from each other by using a third coupling code. Exact solutions to the Euler equations are used to verify the accuracy and stability of the new computational procedure. We also demonstrate the capability of the new procedure in a calculation of the flow in and around a model of a coral.
The present work is targeted at performing a strong scaling study of the high-order spectral element uid dynamics solver Nek5000. Prior studies such as [5] indicated a recommendable metric for strong scalability from a theoretical viewpoint, which we test here extensively on three parallel machines with different performance characteristics and interconnect networks, namely Mira (IBM Blue Gene/Q), Beskow (Cray XC40) and Titan (Cray XK7). The test cases considered for the simulations correspond to a turbulent ow in a straight pipe at four different friction Reynolds numbers Reτ = 180, 360, 550 and 1000. Considering the linear model for parallel communication we quantify the machine characteristics in order to better assess the scaling behaviors of the code. Subsequently sampling and profiling tools are used to measure the computation and communication times over a large range of compute cores. We also study the effect of the two coarse grid solvers XXT and AMG on the computational time. Super-linear scaling due to a reduction in cache misses is observed on each computer. The strong scaling limit is attained for roughly 5000 - 10; 000 degrees of freedom per core on Mira, 30; 000 - 50; 0000 on Beskow, with only a small impact of the problem size for both machines, and ranges between 10; 000 and 220; 000 depending on the problem size on Titan. This work aims at being a reference for Nek5000 users and also serves as a basis for potential issues to address as the community heads towards exascale supercomputers.
Due to its high performance and throughput capabilities, GPU-accelerated computing is becoming a popular technology in scientific computing, in particular using programming models such as CUDA and OpenACC. The main advantage with OpenACC is that it enables to simply port codes in their "original" form to GPU systems through compiler directives, thus allowing an incremental approach. An OpenACC implementation is applied to the CFD code Nek5000 for simulation of incompressible flows, based on the spectral-element method. The work follows up previous implementations and focuses now on the P-N-PN-2 method for the spatial discretization of the Navier-Stokes equations. Performance results of the ported code show a speed-up of up to 3.1 on multi-GPU for a polynomial order N > 11.
We present performance results and an analysis of a message passing interface (MPI)/OpenACC implementation of an electromagnetic solver based on a spectral-element discontinuous Galerkin discretization of the time-dependent Maxwell equations. The OpenACC implementation covers all solution routines, including a highly tuned element-by-element operator evaluation and a GPUDirect gather-scatter kernel to effect nearest neighbor flux exchanges. Modifications are designed to make effective use of vectorization, streaming, and data management. Performance results using up to 16,384 graphics processing units of the Cray XK7 supercomputer Titan show more than 2.5x speedup over central processing unit-only performance on the same number of nodes (262,144 MPI ranks) for problem sizes of up to 6.9 billion grid points. We discuss performance-enhancement strategies and the overall potential of GPU-based computing for this class of problems.
Node-centred edge-based finite volume approximations are very common in computational fluid dynamics since they are assumed to run on structured, unstructured and even on mixed grids. We analyse the accuracy properties of both first and second derivative approximations and conclude that these schemes cannot be used on arbitrary grids as is often assumed. For the Euler equations first-order accuracy can be obtained if care is taken when constructing the grid. For the Navier-Stokes equations, the grid restrictions are so severe that these finite volume schemes have little advantage over structured finite difference schemes. Our theoretical results are verified through extensive computations.
Our objective is to derive stable first-, second- and fourth-order artificial dissipation operators for node based finite volume schemes. Of particular interest are general unstructured grids where the strength of the finite volume method is fully utilised. A commonly used finite volume approximation of the Laplacian will be the basis in the construction of the artificial dissipation. Both a homogeneous dissipation acting in all directions with equal strength and a modification that allows different amount of dissipation in different directions are derived. Stability and accuracy of the new operators are proved and the theoretical results are supported by numerical computations.
In this paper, an automation process to perform Reynolds-Averaged Navier-Stokes (RANS) computational fluid dynamics (CFD) analysis is developed to carry out aerodynamic design and optimization. The aircraft model/geometry is defined by a Common Parametric Aircraft Configuration Schema (CPACS) file, and the analyses are facilitated using high performance computers (HPC). As the computational capability of the available HPC systems is a limiting factor in the complexity of analyses that can be performed, a detailed performance analysis of the open source CFD code SU2 is undertaken and the profiling and performance analyses for large simulations are carried out.
Automatic multidisciplinary design optimization is one of the challenges that are faced in the processes involved in designing efficient wings for aircraft. In this paper we present mixed fidelity aerodynamic and aero-structural optimization methods for designing wings. A novel shape design methodology has been developed - it is based on a mix of the automatic aerodynamic optimization for a reference aircraft model, and the aero-structural optimization for an uninhabited air vehicle (UAV) with a high aspect ratio wing. This paper is a significant step towards making it possible to perform all the core processes for aerodynamic and aero-structural optimization that require special skills in a fully automatic manner - this covers all the processes from creating the mesh for the wing simulation to executing the high-fidelity computational fluid dynamics (CFD) analysis code. Our results confirm that the simulation tools can make it possible for a far broader range of engineering researchers and developers to design aircraft in much simpler and more efficient ways. This is a vital step in the evolution of wing design processes as it means that the extremely expensive laboratory experiments that were traditionally used when designing the wings can now be replaced with more cost effective high performance computing (HPC) simulation that utilize accurate numerical methods.