kth.sePublications
Change search
Refine search result
12 1 - 50 of 100
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Abraham, Mark James
    et al.
    KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Apostolov, Rossen
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Barnoud, Jonathan
    Univ Groningen, NL-9712 CP Groningen, Netherlands.;Univ Bristol, Intangible Real Lab, Bristol, Avon, England..
    Bauer, Paul
    KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Blau, Christian
    KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Bonvin, Alexandre M. J. J.
    Univ Utrecht, Bijvoet Ctr, Fac Sci, Utrecht, Netherlands..
    Chavent, Matthieu
    Univ Paul Sabatier, IPBS, F-31062 Toulouse, France..
    Chodera, John
    Mem Sloan Kettering Canc Ctr, Sloan Kettering Inst, Computat & Syst Biol Program, New York, NY 10065 USA..
    Condic-Jurkic, Karmen
    Mem Sloan Kettering Canc Ctr, Sloan Kettering Inst, Computat & Syst Biol Program, New York, NY 10065 USA.;Open Force Field Consortium, La Jolla, CA USA..
    Delemotte, Lucie
    KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Grubmueller, Helmut
    Max Planck Inst Biophys Chem, D-37077 Gottingen, Germany..
    Howard, Rebecca
    KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Jordan, E. Joseph
    Stockholm Univ, Dept Biochem & Biophys, Sci Life Lab, Box 1031, SE-17121 Solna, Sweden..
    Lindahl, Erik
    KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Ollila, O. H. Samuli
    Univ Helsinki, Inst Biotechnol, SF-00100 Helsinki, Finland..
    Selent, Jana
    Pompeu Fabra Univ, Hosp del Mar Med Res Inst IMIM, Res Programme Biomed Informat, Barcelona 08002, Spain.;Pompeu Fabra Univ, Dept Expt & Hlth Sci, Barcelona 08002, Spain..
    Smith, Daniel G. A.
    Mol Sci Software Inst, Blacksburg, VA 24060 USA..
    Stansfeld, Phillip J.
    Univ Oxford, Dept Biochem, Oxford OX1 2JD, England.;Univ Warwick, Sch Life Sci, Coventry CV4 7AL, W Midlands, England.;Univ Warwick, Dept Chem, Coventry CV4 7AL, W Midlands, England..
    Tiemann, Johanna K. S.
    Univ Leipzig, Fac Med, Inst Med Phys & Biophys, D-04107 Leipzig, Germany..
    Trellet, Mikael
    Univ Utrecht, Bijvoet Ctr, Fac Sci, Utrecht, Netherlands..
    Woods, Christopher
    Univ Bristol, Bristol BS8 1TH, Avon, England..
    Zhmurov, Artem
    KTH, Centres, Science for Life Laboratory, SciLifeLab.
    Sharing Data from Molecular Simulations2019In: Journal of Chemical Information and Modeling, ISSN 1549-9596, E-ISSN 1549-960X, Vol. 59, no 10, p. 4093-4099Article in journal (Refereed)
    Abstract [en]

    Given the need for modern researchers to produce open, reproducible scientific output, the lack of standards and best practices for sharing data and workflows used to produce and analyze molecular dynamics (MD) simulations has become an important issue in the field. There are now multiple well-established packages to perform molecular dynamics simulations, often highly tuned for exploiting specific classes of hardware, each with strong communities surrounding them, but with very limited interoperability/transferability options. Thus, the choice of the software package often dictates the workflow for both simulation production and analysis. The level of detail in documenting the workflows and analysis code varies greatly in published work, hindering reproducibility of the reported results and the ability for other researchers to build on these studies. An increasing number of researchers are motivated to make their data available, but many challenges remain in order to effectively share and reuse simulation data. To discuss these and other issues related to best practices in the field in general, we organized a workshop in November 2018 (https://bioexcel.eu/events/workshop-on-sharing-data-from-molecular-simulations/). Here, we present a brief overview of this workshop and topics discussed. We hope this effort will spark further conversation in the MD community to pave the way toward more open, interoperable, and reproducible outputs coming from research studies using MD simulations.

  • 2.
    Aguilar, Xavier
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Performance Monitoring, Analysis, and Real-Time Introspection on Large-Scale Parallel Systems2020Doctoral thesis, monograph (Other academic)
    Abstract [en]

    High-Performance Computing (HPC) has become an important scientific driver. A wide variety of research ranging for example from drug design to climate modelling is nowadays performed in HPC systems. Furthermore, the tremendous computer power of such HPC systems allows scientists to simulate problems that were unimaginable a few years ago. However, the continuous increase in size and complexity of HPC systems is turning the development of efficient parallel software into a difficult task. Therefore, the use of per- formance monitoring and analysis is a must in order to unveil inefficiencies in parallel software. Nevertheless, performance tools also face challenges as a result of the size of HPC systems, for example, coping with huge amounts of performance data generated.

    In this thesis, we propose a new model for performance characterisation of MPI applications that tackles the challenge of big performance data sets. Our approach uses Event Flow Graphs to balance the scalability of profiling techniques (generating performance reports with aggregated metrics) with the richness of information of tracing methods (generating files with sequences of time-stamped events). In other words, graphs allow to encode ordered se- quences of events without storing the whole sequence of such events, and therefore, they need much less memory and disk space, and are more scal- able. We demonstrate in this thesis how our Event Flow Graph model can be used as a trace compression method. Furthermore, we propose a method to automatically detect the structure of MPI applications using our Event Flow Graphs. This knowledge can afterwards be used to collect performance data in a smarter way, reducing for example the amount of redundant data collected. Finally, we demonstrate that our graphs can be used beyond trace compression and automatic analysis of performance data. We propose a new methodology to use Event Flow Graphs in the task of visual performance data exploration.

    In addition to the Event Flow Graph model, we also explore in this thesis the design and use of performance data introspection frameworks. Future HPC systems will be very dynamic environments providing extreme levels of parallelism, but with energy constraints, considerable resource sharing, and heterogeneous hardware. Thus, the use of real-time performance data to or- chestrate program execution in such a complex and dynamic environment will be a necessity. This thesis presents two different performance data introspec- tion frameworks that we have implemented. These introspection frameworks are easy to use, and provide performance data in real time with very low overhead. We demonstrate, among other things, how our approach can be used to reduce in real time the energy consumed by the system.

    The approaches proposed in this thesis have been validated in different HPC systems using multiple scientific kernels as well as real scientific applica- tions. The experiments show that our approaches in performance character- isation and performance data introspection are not intrusive at all, and can be a valuable contribution to help in the performance monitoring of future HPC systems.

    Download full text (pdf)
    fulltext
  • 3.
    Aguilar, Xavier
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    A Deep Learning-Based Particle-in-Cell Method for Plasma Simulations2021In: 2021 IEEE International Conference On Cluster Computing (CLUSTER 2021), Institute of Electrical and Electronics Engineers (IEEE) , 2021, p. 692-697Conference paper (Refereed)
    Abstract [en]

    We design and develop a new Particle-in-Cell (PIC) method for plasma simulations using Deep-Learning (DL) to calculate the electric field from the electron phase space. We train a Multilayer Perceptron (MLP) and a Convolutional Neural Network (CNN) to solve the two-stream instability test. We verify that the DL-based MLP PIC method produces the correct results using the two-stream instability: the DL-based PIC provides the expected growth rate of the two-stream instability. The DL-based PIC does not conserve the total energy and momentum. However, the DL-based PIC method is stable against the cold-beam instability, affecting traditional PIC methods. This work shows that integrating DL technologies into traditional computational methods is a viable approach for developing next-generation PIC algorithms.

  • 4.
    Ahlin, Daniel
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Bauermeister, Boris
    Stockholm Univ, Dept Phys, Oskar Klein Ctr, Stockholm, Sweden..
    Conrad, Jan
    Stockholm Univ, Dept Phys, Oskar Klein Ctr, Stockholm, Sweden..
    Gardner, Robert
    Univ Chicago, Enrico Fermi Inst, 5640 S Ellis Ave, Chicago, IL 60637 USA..
    Grandi, Luca
    Univ Chicago, Dept Phys, Chicago, IL 60637 USA.;Univ Chicago, Kavli Inst Cosmol Phys, Chicago, IL 60637 USA..
    Riedel, Benedikt
    Univ Chicago, Enrico Fermi Inst, 5640 S Ellis Ave, Chicago, IL 60637 USA..
    Shockley, Evan
    Univ Chicago, Dept Phys, Chicago, IL 60637 USA.;Univ Chicago, Kavli Inst Cosmol Phys, Chicago, IL 60637 USA..
    Stephen, Judith
    Univ Chicago, Enrico Fermi Inst, 5640 S Ellis Ave, Chicago, IL 60637 USA..
    Sundblad, Ragnar
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Thapa, Suchandra
    Univ Chicago, Enrico Fermi Inst, 5640 S Ellis Ave, Chicago, IL 60637 USA..
    Tunnell, Christopher
    Univ Chicago, Dept Phys, Chicago, IL 60637 USA.;Univ Chicago, Kavli Inst Cosmol Phys, Chicago, IL 60637 USA..
    The XENON1T Data Distribution and Processing Scheme2019In: 23rd International Conference on Computing in High Energy and Nuclear Physics (CHEP) / [ed] Forti, A Betev, L Litmaath, M Smirnova, O Hristov, P, EDP Sciences , 2019, Vol. 214, p. 03015-, article id 03015Conference paper (Refereed)
    Abstract [en]

    The XENON experiment is looking for non-baryonic particle dark matter in the universe. The setup is a dual phase time projection chamber (TPC) filled with 3200 kg of ultra-pure liquid xenon. The setup is operated at the Laboratori Nazionali del Gran Sasso (LNGS) in Italy. We present a full overview of the computing scheme for data distribution and job management in XENON1T. The software package Rucio, which is developed by the ATLAS collaboration, facilitates data handling on Open Science Grid (OSG) and European Grid Infrastructure (EGI) storage systems. A tape copy at the Centre for High Performance Computing (PDC) is managed by the Tivoli Storage Manager (TSM). Data reduction and Monte Carlo production are handled by CI Connect which is integrated into the OSG network. The job submission system connects resources at the EGI, OSG, SDSC's Comet, and the campus HPC resources for distributed computing. The previous success in the XENON1T computing scheme is also the starting point for its successor experiment XENONnT, which starts to take data in autumn 2019.

  • 5.
    Al Ahad, Muhammed Abdullah
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Simmendinger, Christian
    T Syst Solut Res GmbH, D-70563 Stuttgart, Germany..
    Iakymchuk, Roman
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Laure, Erwin
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Efficient Algorithms for Collective Operations with Notified Communication in Shared Windows2018In: PROCEEDINGS OF PAW-ATM18: 2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM), IEEE , 2018, p. 1-10Conference paper (Refereed)
    Abstract [en]

    Collective operations are commonly used in various parts of scientific applications. Especially in strong scaling scenarios collective operations can negatively impact the overall applications performance: while the load per rank here decreases with increasing core counts, time spent in e.g. barrier operations will increase logarithmically with the core count. In this article, we develop novel algorithmic solutions for collective operations such as Allreduce and Allgather(V)-by leveraging notified communication in shared windows. To this end, we have developed an extension of GASPI which enables all ranks participating in a shared window to observe the entire notified communication targeted at the window. By exploring benefits of this extension, we deliver high performing implementations of Allreduce and Allgather(V) on Intel and Cray clusters. These implementations clearly achieve 2x-4x performance improvements compared to the best performing MPI implementations for various data distributions.

  • 6.
    Alam, Sadaf R.
    et al.
    Swiss Fed Inst Technol, Swiss Natl Supercomp Ctr CSCS, Zurich, Switzerland..
    Bartolome, Javier
    Barcelona Supercomp Ctr BSC, Barcelona, Spain..
    Carpene, Michele
    Italian Supercomp Ctr CINECA, Casalecchio Di Reno, Italy..
    Happonen, Kalle
    Finnish Supercomp Ctr CSC, Espoo, Finland..
    Lafoucriere, Jacques-Charles
    Commissariat Energie Atom & Energies Alternat CEA, Paris, France..
    Pleiter, Dirk
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science. Juelich Supercomp Ctr JSC, Julich, Germany..
    Fenix: A Pan-European Federation of Supercomputing and Cloud e-Infrastructure Services2022In: Communications of the ACM, ISSN 0001-0782, E-ISSN 1557-7317, Vol. 65, no 4, p. 46-47Article in journal (Other academic)
  • 7.
    Alekseenko, Andrej
    et al.
    KTH, School of Engineering Sciences (SCI), Applied Physics, Biophysics.
    Pall, Szilard
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Lindahl, Erik
    KTH, School of Engineering Sciences (SCI), Applied Physics, Biophysics.
    Experiences with Adding SYCL Support to GROMACS2021In: IWOCL'21: Proceedings International Workshop on OpenCL IWOCL 2021, Association for Computing Machinery (ACM) , 2021Conference paper (Refereed)
    Abstract [en]

    GROMACS is an open-source, high-performance molecular dynamics (MD) package primarily used for biomolecular simulations, accounting for 5% of HPC utilization worldwide. Due to the extreme computing needs of MD, significant efforts are invested in improving the performance and scalability of simulations. Target hardware ranges from supercomputers to laptops of individual researchers and volunteers of distributed computing projects such as Folding@Home. The code has been designed both for portability and performance by explicitly adapting algorithms to SIMD and data-parallel processors. A SIMD intrinsic abstraction layer provides high CPU performance. Explicit GPU acceleration has long used CUDA to target NVIDIA devices and OpenCL for AMD/Intel devices. In this talk, we discuss the experiences and challenges of adding support for the SYCL platform into the established GROMACS codebase and share experiences and considerations in porting and optimization. While OpenCL offers the benefits of using the same code to target different hardware, it suffers from several drawbacks that add significant development friction. Its separate-source model leads to code duplication and makes changes complicated. The need to use C99 for kernels, while the rest of the codebase uses C++17, exacerbates these issues. Another problem is that OpenCL, while supported by most GPU vendors, is never the main framework and thus is not getting the primary support or tuning efforts. SYCL alleviates many of these issues, employing a single-source model based on the modern C++ standard. In addition to being the primary platform for Intel GPUs, the possibility to target AMD and NVIDIA GPUs through other implementations (e.g., hipSYCL) might make it possible to reduce the number of separate GPU ports that have to be maintained. Some design differences from OpenCL, such as flow directed acyclic graphs (DAGs) instead of in-order queues, made it necessary to reconsider the GROMACS's task scheduling approach and architectural choices in the GPU backend. Additionally, supporting multiple GPU platforms presents a challenge of balancing performance (low-level and hardware-specific code) and maintainability (more generalization and code-reuse). We will discuss the limitations of the existing codebase and interoperability layers with regards to adding the new platform; the compute performance and latency comparisons; code quality considerations; and the issues we encountered with SYCL implementations tested. Finally, we will discuss our goals for the next release cycle for the SYCL backend and the overall architecture of GPU acceleration code in GROMACS.

  • 8.
    Asquith, Nathan L.
    et al.
    Univ Leeds, Leeds Inst Cardiovasc & Metab Med, Sch Med, Discovery & Translat Sci Dept, Leeds, England.;Boston Childrens Hosp, Harvard Med Sch, Vasc Biol Program, Karp Res Labs, Boston, MA USA..
    Duval, Cedric
    Univ Leeds, Leeds Inst Cardiovasc & Metab Med, Sch Med, Discovery & Translat Sci Dept, Leeds, England..
    Zhmurov, Artem
    KTH, Centres, Science for Life Laboratory, SciLifeLab. KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. EuroCC Natl Competence Ctr Sweden, Stockholm, Sweden.
    Baker, Stephen R.
    Univ Leeds, Leeds Inst Cardiovasc & Metab Med, Sch Med, Discovery & Translat Sci Dept, Leeds, England..
    McPherson, Helen R.
    Univ Leeds, Leeds Inst Cardiovasc & Metab Med, Sch Med, Discovery & Translat Sci Dept, Leeds, England..
    Domingues, Marco M.
    Univ Leeds, Leeds Inst Cardiovasc & Metab Med, Sch Med, Discovery & Translat Sci Dept, Leeds, England.;Univ Lisbon, Inst Mol Med, Fac Med, Lisbon, Portugal..
    Connell, Simon D. A.
    Univ Leeds, Sch Phys & Astron, Mol & Nanoscale Phys Grp, Leeds, England..
    Barsegov, Valeri
    Univ Massachusetts, Dept Chem, Lowell, MA USA..
    Ariens, Robert A. S.
    Univ Leeds, Leeds Inst Cardiovasc & Metab Med, Sch Med, Discovery & Translat Sci Dept, Leeds, England.;Univ Leeds, Leeds Inst Cardiovasc & Metab Med, Discovery & Translat Sci Dept, Leeds LS2 9JT, England..
    Fibrin protofibril packing and clot stability are enhanced by extended knob-hole interactions and catch-slip bonds2022In: Blood Advances, ISSN 2473-9529, Vol. 6, no 13, p. 4015-4027Article in journal (Refereed)
    Abstract [en]

    Fibrin polymerization involves thrombin-mediated exposure of knobs on one monomer that bind to holes available on another, leading to the formation of fibers. In silico evidence has suggested that the classical A:a knob-hole interaction is enhanced by surrounding residues not directly involved in the binding pocket of hole a, via noncovalent interactions with knob A. We assessed the importance of extended knob-hole interactions by performing biochemical, biophysical, and in silico modeling studies on recombinant human fibrinogen variants with mutations at residues responsible for the extended interactions. Three single fibrinogen variants, yD297N, yE323Q, and yK356Q, and a triple variant yDEK (yD297N/yE323Q/yK356Q) were produced in a CHO (Chinese Hamster Ovary) cell expression system. Longitudinal protofibril growth probed by atomic force microscopy was disrupted for yD297N and enhanced for the yK356Q mutation. Initial polymerization rates were reduced for all variants in turbidimetric studies. Laser scanning confocal microscopy showed that yDEK and yE323Q produced denser clots, whereas yD297N and yK356Q were similar to wild type. Scanning electron microscopy and light scattering studies showed that fiber thickness and protofibril packing of the fibers were reduced for all variants. Clot viscoelastic analysis showed that only yDEK was more readily deformable. In silico modeling suggested that most variants displayed only slip-bond dissociation kinetics compared with biphasic catch-slip kinetics characteristics of wild type. These data provide new evidence for the role of extended interactions in supporting the classical knob-hole bonds involving catch-slip behavior in fibrin formation, clot structure, and clot mechanics.

  • 9.
    Atzori, Marco
    et al.
    KTH, Centres, SeRC - Swedish e-Science Research Centre. KTH, School of Engineering Sciences (SCI), Centres, Linné Flow Center, FLOW.
    Köpp, Wiebke
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Chien, Wei Der
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Theoretical Computer Science, TCS. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Massaro, Daniele
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics.
    Mallor, Fermin
    KTH, School of Engineering Sciences (SCI), Centres, Linné Flow Center, FLOW.
    Peplinski, Adam
    KTH, School of Engineering Sciences (SCI), Centres, Linné Flow Center, FLOW. KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Rezaei, Mohammadtaghi
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Vinuesa, Ricardo
    KTH, School of Engineering Sciences (SCI), Centres, Linné Flow Center, FLOW.
    Laure, E.
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Centres, Linné Flow Center, FLOW.
    Weinkauf, Tino
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    In situ visualization of large-scale turbulence simulations in Nek5000 with ParaView Catalyst2022In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 78, no 3, p. 3605-3620Article in journal (Refereed)
    Abstract [en]

    In situ visualization on high-performance computing systems allows us to analyze simulation results that would otherwise be impossible, given the size of the simulation data sets and offline post-processing execution time. We develop an in situ adaptor for Paraview Catalyst and Nek5000, a massively parallel Fortran and C code for computational fluid dynamics. We perform a strong scalability test up to 2048 cores on KTH’s Beskow Cray XC40 supercomputer and assess in situ visualization’s impact on the Nek5000 performance. In our study case, a high-fidelity simulation of turbulent flow, we observe that in situ operations significantly limit the strong scalability of the code, reducing the relative parallel efficiency to only ≈ 21 % on 2048 cores (the relative efficiency of Nek5000 without in situ operations is ≈ 99 %). Through profiling with Arm MAP, we identified a bottleneck in the image composition step (that uses the Radix-kr algorithm) where a majority of the time is spent on MPI communication. We also identified an imbalance of in situ processing time between rank 0 and all other ranks. In our case, better scaling and load-balancing in the parallel image composition would considerably improve the performance of Nek5000 with in situ capabilities. In general, the result of this study highlights the technical challenges posed by the integration of high-performance simulation codes and data-analysis libraries and their practical use in complex cases, even when efficient algorithms already exist for a certain application scenario.

  • 10. Batelaan, M.
    et al.
    Horsley, R.
    Nakamura, Y.
    Perlt, H.
    Pleiter, Dirk
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Rakow, P. E. L.
    Schierholz, G.
    Stüben, H.
    Young, R. D.
    Zanotti, J. M.
    Collaboration, Q C D S F-U K Q C D-C S S M
    Nucleon Form Factors from the Feynman-Hellmann Method in Lattice QCD2022In: Proceedings of Science, Sissa Medialab Srl , 2022Conference paper (Refereed)
    Abstract [en]

    Lattice QCD calculations of the nucleon electromagnetic form factors are of interest at both the high and low momentum transfer regions. For high momentum transfers especially there are open questions which require more intense study, such as the potential zero crossing in the proton's electric form factor. We will present recent progress from the QCDSF/UKQCD/CSSM collaboration on the calculation of these form factors using the Feynman-Hellmann method in lattice QCD. The Feynman-Hellmann method allows for greater control over excited states which we take advantage of by going to high values of the momentum transfer. In this proceeding we present results of the form factors up to 6 GeV2, using Nf = 2 + 1 flavour fermions for three different pion masses in the range 310-470 MeV. The results are extrapolated to the physical pion mass through the use of a flavour breaking expansion. 

  • 11. Bickerton, J. M.
    et al.
    Cooke, A. N.
    Horsley, R.
    Nakamura, Y.
    Perlt, H.
    Pleiter, Dirk
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Rakow, P. E. L.
    Schierholz, G.
    Stüben, H.
    Young, R. D.
    Zanotti, J. M.
    Patterns of flavour symmetry breaking in hadron matrix elements involving u, d and s quarks2022In: Proceedings of Science, Sissa Medialab Srl , 2022Conference paper (Refereed)
    Abstract [en]

    Using an SU(3)-flavour symmetry breaking expansion between the strange and light quark masses, we determine how this constrains the extrapolation of baryon octet matrix elements and form factors. In particular we can construct certain combinations, which fan out from the symmetric point (when all the quark masses are degenerate) to the point where the light and strange quarks take their physical values. As a further example we consider the vector amplitude at zero momentum transfer for flavour changing currents.

  • 12.
    Borisov, Vladislav
    et al.
    Uppsala Univ, Dept Phys & Astron, Box 516, SE-75120 Uppsala, Sweden..
    Xu, Qichen
    KTH, School of Biotechnology (BIO), Centres, Albanova VinnExcellence Center for Protein Technology, ProNova. KTH, School of Engineering Sciences (SCI), Applied Physics. KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Ntallis, Nikolaos
    Uppsala Univ, Dept Phys & Astron, Box 516, SE-75120 Uppsala, Sweden..
    Clulow, Rebecca
    Uppsala Univ, Dept Chem, Box 538, SE-75121 Uppsala, Sweden..
    Shtender, Vitalii
    Uppsala Univ, Dept Chem, Box 538, SE-75121 Uppsala, Sweden..
    Cedervall, Johan
    Stockholm Univ, Dept Mat & Environm Chem, SE-10691 Stockholm, Sweden..
    Sahlberg, Martin
    Uppsala Univ, Dept Chem, Box 538, SE-75121 Uppsala, Sweden..
    Wikfeldt, Kjartan Thor
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Thonig, Danny
    Uppsala Univ, Dept Phys & Astron, Box 516, SE-75120 Uppsala, Sweden.;Örebro Univ, Sch Sci & Technol, SE-70182 Örebro, Sweden..
    Pereiro, Manuel
    Uppsala Univ, Dept Phys & Astron, Box 516, SE-75120 Uppsala, Sweden..
    Bergman, Anders
    Uppsala Univ, Dept Phys & Astron, Box 516, SE-75120 Uppsala, Sweden..
    Delin, Anna
    KTH, Centres, SeRC - Swedish e-Science Research Centre. KTH, School of Biotechnology (BIO), Centres, Albanova VinnExcellence Center for Protein Technology, ProNova. KTH, School of Engineering Sciences (SCI), Applied Physics.
    Eriksson, Olle
    Uppsala Univ, Dept Phys & Astron, Box 516, SE-75120 Uppsala, Sweden.;Örebro Univ, Sch Sci & Technol, SE-70182 Örebro, Sweden..
    Tuning skyrmions in B20 compounds by 4d and 5d doping2022In: Physical Review Materials, E-ISSN 2475-9953, Vol. 6, no 8, article id 084401Article in journal (Refereed)
    Abstract [en]

    Skyrmion stabilization in novel magnetic systems with the B20 crystal structure is reported here, primarily based on theoretical results. The focus is on the effect of alloying on the 3d sublattice of the B20 structure by substitution of heavier 4d and 5d elements, with the ambition to tune the spin-orbit coupling and its influence on magnetic interactions. State-of-the-art methods based on density functional theory are used to calculate both isotropic and anisotropic exchange interactions. Significant enhancement of the Dzyaloshinskii-Moriya interaction is reported for 5d-doped FeSi and CoSi, accompanied by a large modification of the spin stiffness and spiralization. Micromagnetic simulations coupled to atomistic spin-dynamics and ab initio magnetic interactions reveal the spin-spiral nature of the magnetic ground state and field-induced skyrmions for all these systems. Especially small skyrmions similar to 50 nm are predicted for Co0.75Os0.25Si, compared to similar to 148 nm for Fe0.75Co0.25Si. Convex-hull analysis suggests that all B20 compounds considered here are structurally stable at elevated temperatures and should be possible to synthesize. This prediction is confirmed experimentally by synthesis and structural analysis of the Ru-doped CoSi systems discussed here, both in powder and in single-crystal forms.

  • 13.
    Brand, Manuel
    et al.
    KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Theoretical Chemistry and Biology. KTH Royal Inst Technol, Dept Theoret Chem & Biol, Sch Engn Sci Chem Biotechnol & Hlth, SE-10691 Stockholm, Sweden..
    Ahmadzadeh, Karan
    KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Theoretical Chemistry and Biology. KTH Royal Inst Technol, Dept Theoret Chem & Biol, Sch Engn Sci Chem Biotechnol & Hlth, SE-10691 Stockholm, Sweden..
    Li, Xin
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Theoretical Chemistry and Biology. KTH Royal Inst Technol, Dept Theoret Chem & Biol, Sch Engn Sci Chem Biotechnol & Hlth, SE-10691 Stockholm, Sweden..
    Rinkevicius, Zilvinas
    KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Theoretical Chemistry and Biology. Department of Physics, Faculty of Mathematics and Natural Sciences, Kaunas University of Technology, LT-51368 Kaunas, Lithuania.
    Saidi, Wissam A.
    Univ Pittsburgh, Dept Mech Engn & Mat Sci, Pittsburgh, PA 15261 USA..
    Norman, Patrick
    KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Theoretical Chemistry and Biology. KTH Royal Inst Technol, Dept Theoret Chem & Biol, Sch Engn Sci Chem Biotechnol & Hlth, SE-10691 Stockholm, Sweden..
    Size-dependent polarizabilities and van der Waals dispersion coefficients of fullerenes from large-scale complex polarization propagator calculations2021In: Journal of Chemical Physics, ISSN 0021-9606, E-ISSN 1089-7690, Vol. 154, no 7, article id 074304Article in journal (Refereed)
    Abstract [en]

    While the anomalous non-additive size-dependencies of static dipole polarizabilities and van der Waals C-6 dispersion coefficients of carbon fullerenes are well established, the widespread reported scalings for the latter (ranging from N-2.2 to N-2.8) call for a comprehensive first-principles investigation. With a highly efficient implementation of the linear complex polarization propagator, we have performed Hartree-Fock and Kohn-Sham density functional theory calculations of the frequency-dependent polarizabilities for fullerenes consisting of up to 540 carbon atoms. Our results for the static polarizabilities and C-6 coefficients show scalings of N-1.2 and N-2.2, respectively, thereby deviating significantly from the previously reported values obtained with the use of semi-classical/empirical methods. Arguably, our reported values are the most accurate to date as they represent the first ab initio or first-principles treatment of fullerenes up to a convincing system size.

  • 14.
    Brand, Manuel
    et al.
    KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Chemistry, Theoretical Chemistry and Biology.
    Dreuw, Andreas
    Ruprecht Karls Univ Heidelberg, Interdisciplinary Ctr Sci Comp, D-69120 Heidelberg, Germany..
    Norman, Patrick
    KTH, School of Engineering Sciences in Chemistry, Biotechnology and Health (CBH), Chemistry, Theoretical Chemistry and Biology.
    Li, Xin
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Efficient and Parallel Implementation of Real and Complex Response Functions Employing the Second-Order Algebraic-Diagrammatic Construction Scheme for the Polarization Propagator2023In: Journal of Chemical Theory and Computation, ISSN 1549-9618, E-ISSN 1549-9626, Vol. 20, no 1, p. 103-113Article in journal (Refereed)
    Abstract [en]

    We present the implementation of an efficient matrix-folded formalism for the evaluation of complex response functions and the calculation of transition properties at the level of the second-order algebraic-diagrammatic construction (ADC(2)) scheme. The underlying algorithms, in combination with the adopted hybrid MPI/OpenMP parallelization strategy, enabled calculations of the UV/vis spectra of a guanine oligomer series ranging up to 1032 contracted basis functions, thereby utilizing vast computational resources from up to 32,768 CPU cores. Further analysis of the convergence behavior of the involved iterative subspace algorithms revealed the superiority of a frequency-separated treatment of response equations even for a large spectral window, including 101 frequencies. We demonstrate the applicability to general quantum mechanical operators by the first reported electronic circular dichroism spectrum calculated with a complex polarization propagator approach at the ADC(2) level of theory.

  • 15.
    Brank, Bine
    et al.
    Forschungszentrum Julich, Julich Supercomp Ctr, Julich, Germany..
    Pleiter, Dirk
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Assessing the State of Autovectorization Support based on SVE2022In: 2022 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2022), Institute of Electrical and Electronics Engineers (IEEE) , 2022, p. 556-562Conference paper (Refereed)
    Abstract [en]

    So-called SIMD instructions, which trigger operations that process in each clock cycle a data tuple, have become widespread in modern processor architectures. In particular, processors for high-performance computing (HPC) systems rely on this additional level of parallelism to reach a high throughput of arithmetic operations. Leveraging these SIMD instructions can still be challenging for application software developers. This challenge has become simpler due to a compiler technique called auto-vectorization. In this paper, we explore the current state of auto-vectorization capabilities using state-of-the-art compilers using a recent extension of the Arm instruction set architecture, called SVE. We measure the performance gains on a recent processor architecture supporting SVE, namely the Fujitsu A64FX processor.

  • 16.
    Brocke, Ekaterina
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Djurfeldt, Mikael
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Efficient Spike Communication in the MUSIC Framework on a Blue Gene/Q SupercomputerManuscript (preprint) (Other (popular science, discussion, etc.))
    Download full text (pdf)
    fulltext
  • 17. Camisasca, G.
    et al.
    Pathak, H.
    Wikfeldt, Kjartan Thor
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. Department of Physics, AlbaNova University Center, Stockholm University, Stockholm, SE-10609, Sweden.
    Pettersson, L. Gunnar M.
    Radial distribution functions of water: Models vs experiments2019In: Journal of Chemical Physics, ISSN 0021-9606, E-ISSN 1089-7690, Vol. 151, no 4, article id 044502Article in journal (Refereed)
    Abstract [en]

    We study the temperature behavior of the first four peaks of the oxygen-oxygen radial distribution function of water, simulated by the TIP4P/2005, MB-pol, TIP5P, and SPC/E models and compare to experimental X-ray diffraction data, including a new measurement which extends down to 235 K [H. Pathak et al., J. Chem. Phys. 150, 224506 (2019)]. We find the overall best agreement using the MB-pol and TIP4P/2005 models. We observe, upon cooling, a minimum in the position of the second shell simulated with TIP4P/2005 and SPC/E potentials, located close to the temperature of maximum density. We also calculated the two-body entropy and the contributions coming from the first, second, and outer shells to this quantity. We show that, even if the main contribution comes from the first shell, the contribution of the second shell can become important at low temperature. While real water appears to be less ordered at short distance than obtained by any of the potentials, the different water potentials show more or less order compared to the experiments depending on the considered length-scale.

  • 18.
    Camisasca, Gaia
    et al.
    Stockholm Univ, Dept Phys, S-10691 Stockholm, Sweden..
    Galamba, Nuno
    Univ Lisbon, Fac Sci, Ctr Chem & Biochem, C8 Campo Grande, P-1749016 Lisbon, Portugal.;Univ Lisbon, Fac Sci, Biosyst & Integrat Sci Inst, C8 Campo Grande, P-1749016 Lisbon, Portugal..
    Wikfeldt, Kjartan Thor
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. Stockholm Univ, Dept Phys, S-10691 Stockholm, Sweden..
    Pettersson, Lars G. M.
    Stockholm Univ, Dept Phys, S-10691 Stockholm, Sweden..
    Translational and rotational dynamics of high and low density TIP4P/2005 water2019In: Journal of Chemical Physics, ISSN 0021-9606, E-ISSN 1089-7690, Vol. 150, no 22, article id 224507Article in journal (Refereed)
    Abstract [en]

    We use molecular dynamics simulations using TIP4P/2005 to investigate the self- and distinct-van Hove functions for different local environments of water, classified using the local structure index as an order parameter. The orientational dynamics were studied through the calculation of the time-correlation functions of different-order Legendre polynomials in the OH-bond unit vector. We found that the translational and orientational dynamics are slower for molecules in a low-density local environment and correspondingly the mobility is enhanced upon increasing the local density, consistent with some previous works, but opposite to a recent study on the van Hove function. From the analysis of the distinct dynamics, we find that the second and fourth peaks of the radial distribution function, previously identified as low density-like arrangements, show long persistence in time. The analysis of the time-dependent interparticle distance between the central molecule and the first coordination shell shows that particle identity persists longer than distinct van Hove correlations. The motion of two first-nearest-neighbor molecules thus remains coupled even when this correlation function has been completely decayed. With respect to the orientational dynamics, we show that correlation functions of molecules in a low-density environment decay exponentially, while molecules in a local high-density environment exhibit bi-exponential decay, indicating that dynamic heterogeneity of water is associated with the heterogeneity among high-density and between high-density and low-density species. This bi-exponential behavior is associated with the existence of interstitial waters and the collapse of the second coordination sphere in high-density arrangements, but not with H-bond strength.

  • 19. Chien, Steven W. D.
    et al.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Sishtla, Chaitanya Prasad
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Santos, Luis
    Herman, Pawel
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Nrasimhamurthy, Sai
    Laure, Erwin
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Characterizing Deep-Learning I/O Workloads in TensorFlow2018In: Proceedings of PDSW-DISCS 2018: 3rd Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems, Held in conjunction with SC 2018: The International Conference for High Performance Computing, Networking, Storage and Analysis, Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 54-63Conference paper (Refereed)
    Abstract [en]

    The performance of Deep-Learning (DL) computing frameworks rely on the rformance of data ingestion and checkpointing. In fact, during the aining, a considerable high number of relatively small files are first aded and pre-processed on CPUs and then moved to accelerator for mputation. In addition, checkpointing and restart operations are rried out to allow DL computing frameworks to restart quickly from a eckpoint. Because of this, I/O affects the performance of DL plications. this work, we characterize the I/O performance and scaling of nsorFlow, an open-source programming framework developed by Google and ecifically designed for solving DL problems. To measure TensorFlow I/O rformance, we first design a micro-benchmark to measure TensorFlow ads, and then use a TensorFlow mini-application based on AlexNet to asure the performance cost of I/O and checkpointing in TensorFlow. To prove the checkpointing performance, we design and implement a burst ffer. find that increasing the number of threads increases TensorFlow ndwidth by a maximum of 2.3 x and 7.8 x on our benchmark environments. e use of the tensorFlow prefetcher results in a complete overlap of mputation on accelerator and input pipeline on CPU eliminating the fective cost of I/O on the overall performance. The use of a burst ffer to checkpoint to a fast small capacity storage and copy ynchronously the checkpoints to a slower large capacity storage sulted in a performance improvement of 2.6x with respect to eckpointing directly to slower storage on our benchmark environment.

  • 20.
    Chien, Steven W.D.
    et al.
    University of Edinburgh, United Kingdom.
    Sato, Kento
    RIKEN Center for Computational Science Japan.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS.
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Honda, Michio
    University of Edinburgh, United Kingdom.
    Improving Cloud Storage Network Bandwidth Utilization of Scientific Applications2023In: Proceedings of the 7th Asia-Pacific Workshop on Networking, APNET 2023, Association for Computing Machinery (ACM) , 2023, p. 172-173Conference paper (Refereed)
    Abstract [en]

    Cloud providers began to provide managed services to attract scientific applications, which have been traditionally executed on supercomputers. One example is AWS FSx for Lustre, a fully managed parallel file system (PFS) released in 2018. However, due to the nature of scientific applications, the frontend storage network bandwidth is left completely idle for the majority of its lifetime. Furthermore, the pricing model does not match the scalability requirement. We propose iFast, a novel host-side caching mechanism for scientific applications that improves storage bandwidth utilization and end-to-end application performance: by overlapping compute and data writeback through inexpensive local storage. iFast supports the Massage Passing Interface (MPI) library that is widely used by scientific applications and is implemented as a preloaded library. It requires no change to applications, the MPI library, or support from cloud operators. We demonstrate how iFast can accelerate the end-to-end time of a representative scientific application Neko, by 13-40%.

  • 21.
    Chien, Steven Wei Der
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Olshevsky, Vyacheslav
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Bulatov, Yaroslav
    South Pk Commons, San Francisco, CA USA..
    Laure, Erwin
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Vetter, Jeffrey S.
    Oak Ridge Natl Lab, Oak Ridge, TN USA..
    TensorFlow Doing HPC An Evaluation of TensorFlow Performance in HPC Applications2019In: 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Institute of Electrical and Electronics Engineers (IEEE) , 2019, p. 509-518Conference paper (Refereed)
    Abstract [en]

    TensorFlow is a popular emerging open-source programming framework supporting the execution of distributed applications on heterogeneous hardware. While TensorFlow has been initially designed for developing Machine Learning (ML) applications, in fact TensorFlow aims at supporting the development of a much broader range of application kinds that are outside the ML domain and can possibly include HPC applications. However, very few experiments have been conducted to evaluate TensorFlow performance when running HPC workloads on supercomputers. This work addresses this lack by designing four traditional HPC benchmark applications: STREAM, matrix-matrix multiply, Conjugate Gradient (CG) solver and Fast Fourier Transform (FFT). We analyze their performance on two supercomputers with accelerators and evaluate the potential of TensorFlow for developing HPC applications. Our tests show that TensorFlow can fully take advantage of high performance networks and accelerators on supercomputers. Running our Tensor-Flow STREAM benchmark, we obtain over 50% of theoretical communication bandwidth on our testing platform. We find an approximately 2x, 1.7x and 1.8x performance improvement when increasing the number of GPUs from two to four in the matrix-matrix multiply, CG and FFT applications respectively. All our performance results demonstrate that TensorFlow has high potential of emerging also as HPC programming framework for heterogeneous supercomputers.

  • 22.
    Chien, Steven Wei Der
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Sishtla, Chaitanya Prasad
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Jun, Zhang
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Peng, Ivy Bo
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Laure, Erwin
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    An Evaluation of the TensorFlow Programming Model for Solving Traditional HPC Problems2018In: Proceedings of the 5th International Conference on Exascale Applications and Software, The University of Edinburgh , 2018, p. 34-Conference paper (Refereed)
    Abstract [en]

    Computational intensive applications such as pattern recognition, and natural language processing, are increasingly popular on HPC systems. Many of these applications use deep-learning, a branch of machine learning, to determine the weights of artificial neural network nodes by minimizing a loss function. Such applications depend heavily on dense matrix multiplications, also called tensorial operations. The use of Graphics Processing Unit (GPU) has considerably speeded up deep-learning computations, leading to a Renaissance of the artificial neural network. Recently, the NVIDIA Volta GPU and the Google Tensor Processing Unit (TPU) have been specially designed to support deep-learning workloads. New programming models have also emerged for convenient expression of tensorial operations and deep-learning computational paradigms. An example of such new programming frameworks is TensorFlow, an open-source deep-learning library released by Google in 2015. TensorFlow expresses algorithms as a computational graph where nodes represent operations and edges between nodes represent data flow. Multi-dimensional data such as vectors and matrices which flows between operations are called Tensors. For this reason, computation problems need to be expressed as a computational graph. In particular, TensorFlow supports distributed computation with flexible assignment of operation and data to devices such as GPU and CPU on different computing nodes. Computation on devices are based on optimized kernels such as MKL, Eigen and cuBLAS. Inter-node communication can be through TCP and RDMA. This work attempts to evaluate the usability and expressiveness of the TensorFlow programming model for traditional HPC problems. As an illustration, we prototyped a distributed block matrix multiplication for large dense matrices which cannot be co-located on a single device and a Conjugate Gradient (CG) solver. We evaluate the difficulty of expressing traditional HPC algorithms using computational graphs and study the scalability of distributed TensorFlow on accelerated systems. Our preliminary result with distributed matrix multiplication shows that distributed computation on TensorFlow is extremely scalable. This study provides an initial investigation of new emerging programming models for HPC.

    Download full text (pdf)
    fulltext
  • 23.
    Chien, Wei Der
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    An Evaluation of TensorFlow as a Programming Framework for HPC Applications2018Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    In recent years, deep-learning, a branch of machine learning gained increasing popularity due to their extensive applications and performance. At the core of these application is dense matrix-matrix multiplication. Graphics Processing Units (GPUs) are commonly used in the training process due to their massively parallel computation capabilities. In addition, specialized low-precision accelerators have emerged to specifically address Tensor operations. Software frameworks, such as TensorFlow have also emerged to increase the expressiveness of neural network model development. In TensorFlow computation problems are expressed as Computation Graphs where nodes of a graph denote operation and edges denote data movement between operations. With increasing number of heterogeneous accelerators which might co-exist on the same cluster system, it became increasingly difficult for users to program efficient and scalable applications. TensorFlow provides a high level of abstraction and it is possible to place operations of a computation graph on a device easily through a high level API. In this work, the usability of TensorFlow as a programming framework for HPC application is reviewed. We give an introduction of TensorFlow as a programming framework and paradigm for distributed computation. Two sample applications are implemented on TensorFlow: tiled matrix multiplication and conjugate gradient solver for solving large linear systems. We try to illustrate how such problems can be expressed in computation graph for distributed computation. We perform scalability tests and comment on performance scaling results and quantify how TensorFlow can take advantage of HPC systems by performing micro-benchmarking on communication performance. Through this work, we show that TensorFlow is an emerging and promising platform which is well suited for a particular class of problem which requires very little synchronization.

    Download full text (pdf)
    fulltext
  • 24. De La Motte, S. A.
    et al.
    Hollitt, S. E.
    Horsley, R.
    Jackson, P. D.
    Nakamura, Y.
    Perlt, H.
    Pleiter, Dirk
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Rakow, P. E. L.
    Schierholz, G.
    Stüben, H.
    Young, R. D.
    Zanotti, J. M.
    Measurements of SU(3) f symmetry breaking in B meson decay constants2022In: Proceedings of Science, Sissa Medialab Srl , 2022Conference paper (Refereed)
    Abstract [en]

    We present updates from QCDSF/UKQCD/CSSM on the SU(3) f breaking in B meson decay constants. The b-quarks are generated with an anisotropic clover-improved action, and are tuned to match properties of the physical B and B∗ mesons. Configurations are generated with m = 1/3(2ml + ms) kept constant to control symmetry breaking effects. Various sources of systematic uncertainty will be discussed, including those from continuum extrapolations and extrapolations to the physical point. We also present new efforts to calculate fB and fBs using weighted averages across multiple time fitting regions. The use of an automated weighted averaging technique over multiple fitting ranges allows for timely tuning of the b-quark and reduces the impact of systematic errors from fitting range biases in calculations of fB and fBs. 

  • 25.
    D’Orto, Manolo
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS).
    Sjöblom, Svante
    Chien, Lung Sheng
    Axner, Lilit
    ENCCS, Uppsala University.
    Gong, Jing
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Comparing Different Approaches for Solving Large Scale Power-flow Problems with the Newton-Raphson Method2021In: IEEE Access, E-ISSN 2169-3536, Vol. 9, p. 56604-56615Article in journal (Refereed)
    Abstract [en]

    This paper focuses on using the Newton-Raphson method to solve the power-flow problems. Since the most computationally demanding part of the Newton-Raphson method is to solve the linear equations at each iteration, this study investigates different approaches to solve the linear equations on both central processing unit (CPU) and graphical processing unit (GPU). Six different approaches have been developed and evaluated in this paper: two approaches of these run entirely on CPU while other two of these run entirely on GPU, and the remaining two are hybrid approaches that run on both CPU and GPU. All six direct linear solvers use either LU or QR factorization to solve the linear equations. Two different hardware platforms have been used to conduct the experiments. The performance results show that the CPU version with LU factorization gives better performance compared to the GPU version using standard library called cuSOLVER even for the larger power-flow problems. Moreover, it has been proven that the best performance is achieved using a hybrid method where the Jacobian matrix is assembled on GPU, the preprocessing with a sparse high performance linear solver called KLU is performed on the CPU in the first iteration, and the linear equation is factorized on the GPU and solved on the CPU. Maximum speed up in this study is obtained on the largest case with 25000 buses. The hybrid version shows a speedup factor of 9.6 with a NVIDIA P100 GPU while 13.1 with a NVIDIA V100 GPU in comparison with baseline CPU version on an Intel Xeon Gold 6132 CPU.

  • 26.
    Dugani, Vishwanath
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. KTH, School of Computer Science and Communication (CSC).
    Continuous system-wide profiling of High Performance Computing parallel applications: Profiling high performance applications2016Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    Profiling of an application identifies parts of the code being executed using the hardware performance counters thus providing the application’s performance. Profiling has long been standard in the development process focused on a single execution of a single program. As computing systems have evolved, understanding the bigger picture across multiple machines has become increasingly important. As supercomputing grows in pervasiveness and scale, understanding parallel applications performance and utilization characteristics is critically important, because even minor performance improvements translate into large cost savings. The study surveys various tools for the application. After which, Perfminer was integrated in SCANIA’s Linux clusters to profile CFD and FEA applications exploiting the batch queue system features for continuous system wide profiling, which provides performance insights for high performance applications, with negligible overhead. Perfminer provides stable, accurate profiles and a cluster-scale tool for performance analysis. Perfminer effectively highlights the micro-architectural bottlenecks.

    Download full text (pdf)
    fulltext
  • 27.
    Dyczynski, Matheus
    et al.
    Karolinska Inst, Dept Oncol Pathol, Canc Ctr Karolinska, Stockholm, Sweden..
    Yu, Yasmin
    Karolinska Inst, Dept Oncol Pathol, Canc Ctr Karolinska, Stockholm, Sweden.;Sprint Biosci, Huddinge, Sweden..
    Otrocka, Magdalena
    Karolinska Inst, Dept Med Biochem & Biophys, Sci Life Lab Stockholm, Chem Biol Consortium Sweden, Solna, Sweden..
    Parpal, Santiago
    Sprint Biosci, Huddinge, Sweden..
    Braga, Tiago
    Sprint Biosci, Huddinge, Sweden..
    Henley, Aine Brigette
    Sprint Biosci, Huddinge, Sweden..
    Zazzi, Henric
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Lerner, Mikael
    Karolinska Inst, Dept Oncol Pathol, Canc Ctr Karolinska, Stockholm, Sweden..
    Wennerberg, Krister
    Univ Helsinki, Inst Mol Med Finland, FIMM, Helsinki, Finland..
    Viklund, Jenny
    Sprint Biosci, Huddinge, Sweden..
    Martinsson, Jessica
    Sprint Biosci, Huddinge, Sweden..
    Grander, Dan
    Karolinska Inst, Dept Oncol Pathol, Canc Ctr Karolinska, Stockholm, Sweden..
    De Milito, Angelo
    Karolinska Inst, Dept Oncol Pathol, Canc Ctr Karolinska, Stockholm, Sweden.;Sprint Biosci, Huddinge, Sweden..
    Tamm, Katja Pokrovskaja
    Karolinska Inst, Dept Oncol Pathol, Canc Ctr Karolinska, Stockholm, Sweden..
    Targeting autophagy by small molecule inhibitors of vacuolar protein sorting 34 (Vps34) improves the sensitivity of breast cancer cells to Sunitinib2018In: Cancer Letters, ISSN 0304-3835, E-ISSN 1872-7980, Vol. 435, p. 32-43Article in journal (Refereed)
    Abstract [en]

    Resistance to chemotherapy is a challenging problem for treatment of cancer patients and autophagy has been shown to mediate development of resistance. In this study we systematically screened a library of 306 known anti-cancer drugs for their ability to induce autophagy using a cell-based assay. 114 of the drugs were classified as autophagy inducers; for 16 drugs, the cytotoxicity was potentiated by siRNA-mediated knock-down of Atg7 and Vps34. These drugs were further evaluated in breast cancer cell lines for autophagy induction, and two tyrosine kinase inhibitors, Sunitinib and Erlotinib, were selected for further studies. For the pharmacological inhibition of autophagy, we have characterized here a novel highly potent selective inhibitor of Vps34, SB02024. SB02024 blocked autophagy in vitro and reduced xenograft growth of two breast cancer cell lines, MDA-MB-231 and MCF-7, in vivo. Vps34 inhibitor significantly potentiated cytotoxicity of Sunitinib and Erlotinib in MCF-7 and MDA-MB-231 in vitro in monolayer cultures and when grown as multicellular spheroids. Our data suggests that inhibition of autophagy significantly improves sensitivity to Sunitinib and Erlotinib and that Vps34 is a promising therapeutic target for combination strategies in breast cancer.

  • 28.
    Dykes, Tim
    et al.
    HPE HPC/AI EMEA Research Lab.
    Foyer, Clément
    HPE HPC/AI EMEA Research Lab, HPC Research Group, Univ. of Bristol.
    Richardson, Harvey
    HPE HPC/AI EMEA Research Lab.
    Svedin, Martin
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Tate, Adrian
    Numerical Algorithms Group Ltd. (NAG).
    McIntosh-Smith, Simon
    HPC Research Group, Univ. of Bristol.
    Mamba: Portable Array-based Abstractions for Heterogeneous High-Performance Systems2021In: 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Institute of Electrical and Electronics Engineers (IEEE) , 2021Conference paper (Refereed)
    Abstract [en]

    High performance computing architectures have become increasingly heterogeneous in recent times. This growing architectural variety presents a multi-faceted portability problem affecting applications, libraries, programming models, languages, compilers, run-times, and system software. Approaches for performance portability typically focus heavily on efficient usage of parallel compute architectures and less on data locality abstractions and complex memory systems, with minimal support afforded to effective memory management in traditional HPC languages such as C and Fortran. We present Mamba, a library to facilitate usage of heterogeneous memory systems by high performance application/library developers through high level array-based abstractions for memory management supported by a low-level generic memory API. We detail the library design and implementation, demonstrating generic memory allocation, data layout specification, array tiling and heterogeneous transport. We evaluate performance in the context of a typical matrix transposition, DNA sequencing benchmark, and an application use case for high-order spectral element based incompressible flow.

  • 29. Eliasson, P.
    et al.
    Gong, Jing
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Nordström, J.
    A stable and conservative coupling of the unsteady compressible navier-stokes equations at interfaces using finite difference and finite volume methods2018In: AIAA Aerospace Sciences Meeting, 2018, American Institute of Aeronautics and Astronautics Inc, AIAA , 2018, no 210059Conference paper (Refereed)
    Abstract [en]

    Stable and conservative interface boundary conditions are developed for the unsteady compressible Navier-Stokes equations using finite difference and finite volume methods. The finite difference approach is based on summation-by-part operators and can be made higher order accurate with boundary conditions imposed weakly. The finite volume approach is an edge- and dual grid-based approach for unstructured grids, formally second order accurate in space, with weak boundary conditions as well. Stable and conservative weak boundary conditions are derived for interfaces between finite difference methods, for finite volume methods and for the coupling between the two approaches. The three types of interface boundary conditions are demonstrated for two test cases. Firstly, inviscid vortex propagation with a known analytical solution is considered. The results show expected error decays as the grid is refined for various couplings and spatial accuracy of the finite difference scheme. The second test case involves viscous laminar flow over a cylinder with vortex shedding. Calculations with various coupling and spatial accuracies of the finite difference solver show that the couplings work as expected and that the higher order finite difference schemes provide enhanced vortex propagation.

  • 30.
    Eriksson, Olivia
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Laure, Erwin
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Lindahl, Erik
    KTH, School of Engineering Sciences (SCI), Applied Physics, Biophysics. KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Henningson, Dan S.
    KTH, School of Engineering Sciences (SCI), Mechanics, Stability, Transition and Control. KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Ynnerman, Anders
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, Centres, SeRC - Swedish e-Science Research Centre.
    e-Science in Scandinavia2018In: Informatik-Spektrum, ISSN 0170-6012, E-ISSN 1432-122X, Vol. 41, no 6, p. 398-404Article in journal (Refereed)
  • 31. Fedorov, V. A.
    et al.
    Kholina, E. G.
    Kovalenko, I. B.
    Gudimchuk, N. B.
    Orekhov, P. S.
    Zhmurov, Artem
    KTH, Centres, Science for Life Laboratory, SciLifeLab. KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Update on Performance Analysis of Different Computational Architectures: Molecular Dynamics in Application to Protein-Protein Interactions2020In: Supercomputing Frontiers and Innovations, ISSN 2409-6008, Vol. 7, no 4, p. 62-67Article in journal (Refereed)
    Abstract [en]

    Molecular dynamics has proved itself as a powerful computer simulation method to study dynamics, conformational changes, and interactions of biological macromolecules and their complexes. In order to achieve the best performance and efficiency, it is crucial to benchmark various hardware platforms for the simulations of realistic biomolecular systems with different size and timescale. Here, we compare performance and scalability of a number of commercially available computing architectures using all-atom and coarse-grained molecular dynamics simulations of water and the Ndc80-microtubule protein complex in the GROMACS-2019.4 package. We report typical single-node performance of various combinations of modern CPUs and GPUs, as well as multiple-node performance of the “Lomonosov-2” supercomputer. These data can be used as the practical guidelines for choosing optimal hardware for molecular dynamics simulations. 

  • 32.
    Haine, Christopher
    et al.
    HPE HPC AI Res Lab, Basel, Switzerland..
    Haus, Utz-Uwe
    HPE HPC AI Res Lab, Basel, Switzerland..
    Martinasso, Maxime
    Swiss Natl Supercomp Ctr, CSCS, CH-6900 Lugano, Switzerland..
    Pleiter, Dirk
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. Forschungszentrum Julich, D-52425 Julich, Germany..
    Tessier, Francois
    Inria Rennes Bretagne Atlantique, F-35042 Rennes, France..
    Sarmany, Domokos
    European Ctr Medium Range Weather Forecasts ECMWF, Reading RG2 9AX, Berks, England..
    Smart, Simon
    European Ctr Medium Range Weather Forecasts ECMWF, Reading RG2 9AX, Berks, England..
    Quintino, Tiago
    European Ctr Medium Range Weather Forecasts ECMWF, Reading RG2 9AX, Berks, England..
    Tate, Adrian
    NAG, Oxford, England..
    A Middleware Supporting Data Movement in Complex and Software-Defined Storage and Memory Architectures2021In: High Performance Computing - ISC High Performance Digital 2021 International Workshops / [ed] Jagode, H Anzt, H Ltaief, H Luszczek, P, Springer Nature , 2021, Vol. 12761, p. 346-357Conference paper (Refereed)
    Abstract [en]

    Among the broad variety of challenges that arise from workloads in a converged HPC and Cloud infrastructure, data movement is of paramount importance, especially oncoming exascale systems featuring multiple tiers of memory and storage. While the focus has, for years, been primarily on optimizing computations, the importance of improving data handling on such architectures is now well understood. As optimization techniques can be applied at different stages (operating system, run-time system, programming environment, and so on), a middleware providing a uniform and consistent data awareness becomes necessary. In this paper, we introduce a novel memory- and data-aware middleware called Maestro, designed for data orchestration.

  • 33.
    Hasan, Md Nur
    et al.
    Department of Chemical and Biological Sciences, S. N. Bose National Centre for Basic Sciences, Block JD, Sector-III, SaltLake, Kolkata 700 106, India.
    Bharati, Ritadip
    School of Physical Sciences, National Institute of Science Education and Research HBNI, Jatni - 752050, Odisha, India.
    Hellsvik, Johan
    KTH, School of Engineering Sciences (SCI), Physics, Condensed Matter Theory. KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Delin, Anna
    KTH, School of Engineering Sciences (SCI), Applied Physics, Materials and Nanophysics. KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Pal, Samir Kumar
    Department of Chemical and Biological Sciences, S. N. Bose National Centre for Basic Sciences, Block JD, Sector-III, SaltLake, Kolkata 700 106, India.
    Bergman, Anders
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden.
    Sharma, Shivalika
    Asia Pacific Center for Theoretical Physics, Pohang 37673, Republic of Korea.
    Di Marco, Igor
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden; Asia Pacific Center for Theoretical Physics, Pohang 37673, Republic of Korea; Department of Physics, Pohang University of Science and Technology, Pohang 37673, Republic of Korea.
    Pereiro, Manuel
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden.
    Thunström, Patrik
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden.
    Oppeneer, Peter M.
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden.
    Eriksson, Olle
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden.
    Karmakar, Debjani
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden; Technical Physics Division, Bhabha Atomic Research Centre, Mumbai 400085, India.
    Magnetism in A V3Sb5 (A=Cs, Rb, and K): Origin and Consequences for the Strongly Correlated Phases2023In: Physical Review Letters, ISSN 0031-9007, E-ISSN 1079-7114, Vol. 131, no 19, article id 196702Article in journal (Refereed)
    Abstract [en]

    The V-based kagome systems AV3Sb5 (A=Cs, Rb, and K) are unique by virtue of the intricate interplay of nontrivial electronic structure, topology, and intriguing fermiology, rendering them to be a playground of many mutually dependent exotic phases like charge-order and superconductivity. Despite numerous recent studies, the interconnection of magnetism and other complex collective phenomena in these systems has yet not arrived at any conclusion. Using first-principles tools, we demonstrate that their electronic structures, complex fermiologies and phonon dispersions are strongly influenced by the interplay of dynamic electron correlations, nontrivial spin-polarization and spin-orbit coupling. An investigation of the first-principles-derived intersite magnetic exchanges with the complementary analysis of q dependence of the electronic response functions and the electron-phonon coupling indicate that the system conforms as a frustrated spin cluster, where the occurrence of the charge-order phase is intimately related to the mechanism of electron-phonon coupling, rather than the Fermi-surface nesting.

  • 34.
    Jansen, Karin A.
    et al.
    AMOLF, Biol Soft Matter Grp, Utrecht, Netherlands.;UMC Utrecht, Dept Pathol, NL-3508 GA Utrecht, Netherlands..
    Zhmurov, Artem
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. Sechenov Univ, Moscow 119991, Russia..
    Vos, Bart E.
    AMOLF, Biol Soft Matter Grp, Utrecht, Netherlands.;Univ Munster, Ctr Mol Biol Inflammat, Inst Cell Biol, Munster, Germany..
    Portale, Giuseppe
    Univ Groningen, Zernike Inst Adv Mat, Macromol Chem & New Polymer Mat, Nijenborgh 4, NL-9747 AG Groningen, Netherlands..
    Hermida-Merino, Daniel
    ESRF, DUBBLE CRG, Netherlands Org Sci Res NWO, 71 Ave Martyrs, F-38000 Grenoble, France..
    Litvinov, Rustem, I
    Univ Penn, Perelman Sch Med, Dept Cell & Dev Biol, Philadelphia, PA 19104 USA.;Kazan Fed Univ, Inst Fundamental Med & Biol, 18 Kremlyovskaya St, Kazan 420008, Russia..
    Tutwiler, Valerie
    Univ Penn, Perelman Sch Med, Dept Cell & Dev Biol, Philadelphia, PA 19104 USA..
    Kurniawan, Nicholas A.
    AMOLF, Biol Soft Matter Grp, Utrecht, Netherlands.;Eindhoven Univ Technol, Dept Biomed Engn, Eindhoven, Netherlands.;Eindhoven Univ Technol, Inst Complex Mol Syst, Eindhoven, Netherlands..
    Bras, Wim
    ESRF, DUBBLE CRG, Netherlands Org Sci Res NWO, 71 Ave Martyrs, F-38000 Grenoble, France.;Oak Ridge Natl Lab, Chem Sci Div, One Bethel Valley Rd, Oak Ridge, TN 37831 USA..
    Weisel, John W.
    Univ Penn, Perelman Sch Med, Dept Cell & Dev Biol, Philadelphia, PA 19104 USA..
    Barsegov, Valeri
    Univ Massachusetts, Dept Chem, 1 Univ Ave, Lowell, MA 01854 USA..
    Koenderink, Gijsje H.
    AMOLF, Biol Soft Matter Grp, Utrecht, Netherlands.;Delft Univ Technol, Kavli Inst Nanosci Delft, Dept Bionanosci, Maasweg 9, NL-2629 HZ Delft, Netherlands..
    Molecular packing structure of fibrin fibers resolved by X-ray scattering and molecular modeling2020In: Soft Matter, ISSN 1744-683X, E-ISSN 1744-6848, Vol. 16, no 35, p. 8272-8283Article in journal (Refereed)
    Abstract [en]

    Fibrin is the major extracellular component of blood clots and a proteinaceous hydrogel used as a versatile biomaterial. Fibrin forms branched networks built of laterally associated double-stranded protofibrils. This multiscale hierarchical structure is crucial for the extraordinary mechanical resilience of blood clots, yet the structural basis of clot mechanical properties remains largely unclear due, in part, to the unresolved molecular packing of fibrin fibers. Here the packing structure of fibrin fibers is quantitatively assessed by combining Small Angle X-ray Scattering (SAXS) measurements of fibrin reconstituted under a wide range of conditions with computational molecular modeling of fibrin protofibrils. The number, positions, and intensities of the Bragg peaks observed in the SAXS experiments were reproduced computationally based on the all-atom molecular structure of reconstructed fibrin protofibrils. Specifically, the model correctly predicts the intensities of the reflections of the 22.5 nm axial repeat, corresponding to the half-staggered longitudinal arrangement of fibrin molecules. In addition, the SAXS measurements showed that protofibrils within fibrin fibers have a partially ordered lateral arrangement with a characteristic transverse repeat distance of 13 nm, irrespective of the fiber thickness. These findings provide fundamental insights into the molecular structure of fibrin clots that underlies their biological and physical properties.

  • 35.
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    A Hybrid MPI+PGAS Approach to Improve Strong Scalability Limits of Finite Element Solvers2020In: Proceedings - IEEE International Conference on Cluster Computing, ICCC, Institute of Electrical and Electronics Engineers (IEEE) , 2020, p. 303-313Conference paper (Refereed)
    Abstract [en]

    Current finite element codes scale reasonably well as long as each core has sufficient amount of local work that can balance communication costs. However, achieving efficient performance at exascale will require unreasonable large problem sizes, in particular for low-order methods, where the small amount of work per element already is a limiting factor on current post petascale machines. Key bottlenecks for these methods are sparse matrix assembly, where communication latency starts to limit performance as the number of cores increases, and linear solvers, where efficient overlapping is necessary to amortize communication and synchronization cost of sparse matrix vector multiplication and dot products. We present our work on improving strong scalability limits of message passing based general low-order finite element based solvers. Using lightweight one-sided communication offered by partitioned global address space languages (PGAS), we demonstrate that the scalability of performance critical, latency sensitive sparse matrix assembly can achieve almost an order of magnitude better scalability. Linear solvers are also addressed via a signaling put algorithm for low-cost point-to-point synchronization, achieving similar performance as message passing based linear solvers. We introduce a new hybrid MPI+PGAS implementation of the open source general finite element framework FEniCS, replacing the linear algebra backend with a new library written in Unified Parallel C (UPC). A detailed description of the implementation and the hybrid interface to FEniCS is given, and the feasibility of the approach is demonstrated via a performance study of the hybrid implementation on Cray XC40 machines.

  • 36.
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Improving Strong Scalability Limits of Finite Element Based Solvers2019Conference paper (Refereed)
    Abstract [en]

    Current finite element codes scale reasonably well as long as each core has sufficient amount of local work that can balance communication costs. However, achieving efficient performance at exascale will require unreasonable large problem sizes, in particular for low-order methods, where the small amount of work per element already is a limiting factor on current post petascale machines. One of the key bottlenecks for these methods is sparse matrix assembly, where communication latency starts to limit performance as the number of cores increases. We present our work on improving strong scalability limits of message passing based general low-order finite element based solvers. Using lightweight one-sided communication, we demonstrate that the scalability of performance critical, latency sensitive kernels can achieve almost an order of magnitude better scalability. We introduce a new hybrid MPI/PGAS implementation of the open source general finite element framework FEniCS, replacing the linear algebra backend with a new library written in UPC. A detailed description of the implementation and the hybrid interface to FEniCS is given, and we present a detailed performance study of the hybrid implementation on Cray XC40 machines.

  • 37.
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Spectral Element Simulations on the NEC SX-Aurora TSUBASA2021In: HPC Asia 2021: The International Conference on High Performance Computing in Asia-Pacific Region, Association for Computing Machinery (ACM) , 2021Conference paper (Refereed)
    Abstract [en]

    Following the recent transition in the high performance computing landscape to more heterogeneous architectures, application developers are faced with the challenge of ensuring good performance across a diverse set of platforms. In this paper, we present our work on porting the spectral element code Nek5000 to the recent vector architecture SX-Aurora TSUBASA. Using Nek5000's mini-app Nekbone, we formulate suitable loop transformations in key kernels, allowing for better vectorization, increasing the baseline performance by a factor of six. Using the new transformations, we demonstrate that the main compute intensive matrix-vector and matrix-matrix multiplication kernels achieves close to half the peak performance of a SX-Aurora core. Our work also addresses the gather-scatter operations, a key kernel for efficient matrix-free spectral element formulation. We introduce a new implementation of Nek5000's gather-scatter library with mesh topology awareness for improved vectorization via exploitation of the SX-Aurora's hardware gather-scatter instructions, improving performance with up to 116%. A detailed description of the implementation is given together with a performance study, comparing both single node performance and strong scalability characteristics, running across multiple SX-Aurora cards.

    Download full text (pdf)
    fulltext
  • 38.
    Jansson, Niclas
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Karp, Martin
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Perez, Adalberto
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics.
    Mukha, Timofey
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics.
    Ju, Yi
    Max Planck Computing and Data Facility, Garching, Germany.
    Liu, Jiahui
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Pall, Szilard
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Laure, Erwin
    Max Planck Computing and Data Facility, Garching, Germany.
    Weinkauf, Tino
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Schumacher, Jörg
    Technische Universität Ilmenau, Ilmenau, Germany.
    Schlatter, Philipp
    Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg, Germany.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Exploring the Ultimate Regime of Turbulent Rayleigh–Bénard Convection Through Unprecedented Spectral-Element Simulations2023In: SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Association for Computing Machinery (ACM) , 2023, p. 1-9, article id 5Conference paper (Refereed)
    Abstract [en]

    We detail our developments in the high-fidelity spectral-element code Neko that are essential for unprecedented large-scale direct numerical simulations of fully developed turbulence. Major inno- vations are modular multi-backend design enabling performance portability across a wide range of GPUs and CPUs, a GPU-optimized preconditioner with task overlapping for the pressure-Poisson equation and in-situ data compression. We carry out initial runs of Rayleigh–Bénard Convection (RBC) at extreme scale on the LUMI and Leonardo supercomputers. We show how Neko is able to strongly scale to 16,384 GPUs and obtain results that are not pos- sible without careful consideration and optimization of the entire simulation workflow. These developments in Neko will help resolv- ing the long-standing question regarding the ultimate regime in RBC. 

  • 39.
    Jansson, Niclas
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Karp, Martin
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics, Turbulent simulations laboratory.
    Neko: A modern, portable, and scalable framework for high-fidelity computational fluid dynamics2024In: Computers & Fluids, ISSN 0045-7930, E-ISSN 1879-0747, Vol. 275, p. 106243-106243, article id 106243Article in journal (Refereed)
    Abstract [en]

    Computational fluid dynamics (CFD), in particular applied to turbulent flows, is a research area with great engineering and fundamental physical interest. However, already at moderately high Reynolds numbers the computational cost becomes prohibitive as the range of active spatial and temporal scales is quickly widening. Specifically scale-resolving simulations, including large-eddy simulation (LES) and direct numerical simulations (DNS), thus need to rely on modern efficient numerical methods and corresponding software implementations. Recent trends and advancements, including more diverse and heterogeneous hardware in High-Performance Computing (HPC), are challenging software developers in their pursuit for good performance and numerical stability. The well-known maxim “software outlives hardware” may no longer necessarily hold true, and developers are today forced to re-factor their codebases to leverage these powerful new systems. In this paper, we present Neko, a new portable framework for high-order spectral element discretization, targeting turbulent flows in moderately complex geometries. Neko is fully available as open software. Unlike prior works, Neko adopts a modern object-oriented approach in Fortran 2008, allowing multi-tier abstractions of the solver stack and facilitating hardware backends ranging from general-purpose processors (CPUs) down to exotic vector processors and FPGAs. We show that Neko’s performance and accuracy are comparable to NekRS, and thus on-par with Nek5000’s successor on modern CPU machines. Furthermore, we develop a performance model, which we use to discuss challenges and opportunities for high-order solvers on emerging hardware

  • 40.
    Ju, Yi
    et al.
    Max Planck Computing and Data Facility, Max Planck Computing and Data Facility.
    Li, Mingshuai
    Technical University of Munich, Technical University of Munich.
    Perez, Adalberto
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics.
    Bellentani, Laura
    CINECA, Cineca.
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics. Friedrich-Alexander-Universität Erlangen-Nürnberg.
    Laure, Erwin
    Max Planck Computing and Data Facility, Max Planck Computing and Data Facility.
    In-Situ Techniques on GPU-Accelerated Data-Intensive Applications2023In: Proceedings 2023 IEEE 19th International Conference on e-Science, e-Science 2023, Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper (Refereed)
    Abstract [en]

    The computational power of High-Performance Computing (HPC) systems is constantly increasing, however, their input/output (IO) performance grows relatively slowly, and their storage capacity is also limited. This unbalance presents significant challenges for applications such as Molecular Dynamics (MD) and Computational Fluid Dynamics (CFD), which generate massive amounts of data for further visualization or analysis. At the same time, checkpointing is crucial for long runs on HPC clusters, due to limited walltimes and/or failures of system components, and typically requires the storage of large amount of data. Thus, restricted IO performance and storage capacity can lead to bottlenecks for the performance of full application workflows (as compared to computational kernels without IO). In-situ techniques, where data is further processed while still in memory rather to write it out over the I/O subsystem, can help to tackle these problems. In contrast to traditional post-processing methods, in-situ techniques can reduce or avoid the need to write or read data via the IO subsystem. They offer a promising approach for applications aiming to leverage the full power of large scale HPC systems. In-situ techniques can also be applied to hybrid computational nodes on HPC systems consisting of graphics processing units (GPUs) and central processing units (CPUs). On one node, the GPUs would have significant performance advantages over the CPUs. Therefore, current approaches for GPU-accelerated applications often focus on maximizing GPU usage, leaving CPUs underutilized. In-situ tasks using CPUs to perform data analysis or preprocess data concurrently to the running simulation, offer a possibility to improve this underutilization.

  • 41.
    Karmakar, Debjani
    et al.
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden; Technical Physics Division, Bhabha Atomic Research Centre, Mumbai 400085, India.
    Pereiro, Manuel
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden.
    Hasan, Md Nur
    Department of Chemical and Biological Sciences, S. N. Bose National Centre for Basic Sciences, Block JD, Sector-III, SaltLake, Kolkata 700 106, India.
    Bharati, Ritadip
    School of Physical Sciences, National Institute of Science Education and Research, Homi Bhabha National Institute (HBNI), Jatni, 752050 Odisha, India.
    Hellsvik, Johan
    KTH, School of Engineering Sciences (SCI), Physics, Condensed Matter Theory. KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Delin, Anna
    KTH, School of Engineering Sciences (SCI), Applied Physics, Materials and Nanophysics. KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Pal, Samir Kumar
    Department of Chemical and Biological Sciences, S. N. Bose National Centre for Basic Sciences, Block JD, Sector-III, SaltLake, Kolkata 700 106, India.
    Bergman, Anders
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden.
    Sharma, Shivalika
    Asia Pacific Center for Theoretical Physics, Pohang 37673, Republic of Korea.
    Di Marco, Igor
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden; Asia Pacific Center for Theoretical Physics, Pohang 37673, Republic of Korea; Department of Physics, Pohang University of Science and Technology, Pohang 37673, Republic of Korea.
    Thunström, Patrik
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden.
    Oppeneer, Peter M.
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden.
    Eriksson, Olle
    Department of Physics and Astronomy, Uppsala University, Box 516, SE-751 20 Uppsala, Sweden.
    Magnetism in A V3Sb5 (A=Cs, Rb, K): Complex landscape of dynamical magnetic textures2023In: Physical Review B, ISSN 2469-9950, E-ISSN 2469-9969, Vol. 108, no 17, article id 174413Article in journal (Refereed)
    Abstract [en]

    We have investigated the dynamical magnetic properties of the V-based kagome stibnite compounds by combining the ab initio-extracted magnetic parameters of a spin-Hamiltonian, like inter-site exchange parameters, magnetocrystalline anisotropy and site projected magnetic moments, with full-fledged simulations of atomistic spin- dynamics. Our calculations reveal that, in addition to a ferromagnetic order along the [001] direction, the system hosts a complex landscape of magnetic configurations comprised of commensurate and incommensurate spin spirals along the [010] direction. The presence of such chiral magnetic textures may be the key toward solving the mystery about the origin of the experimentally observed inherent breaking of the C6 rotational, mirror, and the time-reversal symmetry.

  • 42.
    Karp, Martin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Centres, Linné Flow Center, FLOW. KTH, Centres, SeRC - Swedish e-Science Research Centre. KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics.
    Markidis, Stefano
    KTH, Centres, SeRC - Swedish e-Science Research Centre. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Optimization of Tensor-product Operations in Nekbone on GPUs2020Conference paper (Refereed)
    Abstract [en]

    In the CFD solver Nek5000, the computation is dominated by the evaluation of small tensor operations. Nekbone is a proxy app for Nek5000 and has previously been ported to GPUs with a mixed OpenACC and CUDA approach. In this work, we continue this effort and optimize the main tensor-product operation in Nekbone further. Our optimization is done in CUDA and uses a different, 2D, thread structure to make the computations layer by layer. This enables us to use loop unrolling as well as utilize registers and shared memory efficiently. Our implementation is then compared on both the Pascal and Volta GPU architectures to previous GPU versions of Nekbone as well as a measured roofline. The results show that our implementation outperforms previous GPU Nekbone implementations by 6-10%. Compared to the measured roofline, we obtain 77-92% of the peak performance for both Nvidia P100 and V100 GPUs for inputs with 1024-4096 elements and polynomial degree 9.

    Download full text (pdf)
    fulltext
  • 43.
    Karp, Martin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Reducing Communication in the Conjugate Gradient Method: A Case Study on High-Order Finite Elements2022In: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC 2022, Association for Computing Machinery (ACM) , 2022, article id 2Conference paper (Refereed)
    Abstract [en]

    Currently, a major bottleneck for several scientific computations is communication, both communication between different processors, so-called horizontal communication, and vertical communication between different levels of the memory hierarchy. With this bottleneck in mind, we target a notoriously communication-bound solver at the core of many high-performance applications, namely the conjugate gradient method (CG). To reduce the communication we present lower bounds on the vertical data movement in CG and go on to make a CG solver with reduced data movement. Using our theoretical analysis we apply our CG solver on a high-performance discretization used in practice, the spectral element method (SEM). Guided by our analysis, we show that for the Poisson equation on modern GPUs we can improve the performance by 30% by both rematerializing the discrete system and by reformulating the system to work on unique degrees of freedom. In order to investigate how horizontal communication can be reduced, we compare CG to two communication-reducing techniques, namely communication-avoiding and pipelined CG. We strong scale up to 4096 CPU cores and showcase performance improvements of upwards of 70% for pipelined CG compared to standard CG when applied on SEM at scale. We show that in addition to improving the scaling capabilities of the solver, initial measurements indicate that the convergence of SEM is largely unaffected by pipelined CG.

  • 44.
    Karp, Martin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Liu, Felix
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). Raysearch Laboratories..
    Stanly, Ronith
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics.
    Rezaeiravesh, Saleh
    The University of Manchester, Manchester, United Kingdom.
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics. Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg, Erlangen, Germany.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Uncertainty Quantification of Reduced-Precision Time Series in Turbulent Channel Flow2023In: Proceedings of 2023 SC Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023, Association for Computing Machinery (ACM) , 2023, p. 387-390Conference paper (Refereed)
    Abstract [en]

    With increased computational power through the use of arithmetic in low-precision, a relevant question is how lower precision affects simulation results, especially for chaotic systems where analytical round-off estimates are non-trivial to obtain. In this work, we consider how the uncertainty of the time series of a direct numerical simulation of turbulent channel flow at Ret = 180 is affected when restricted to a reduced-precision representation. We utilize a non-overlapping batch means estimator and find that the mean statistics can, in this case, be obtained with significantly fewer mantissa bits than conventional IEEE-754 double precision, but that the mean values are observed to be more sensitive in the middle of the channel than in the near-wall region. This indicates that using lower precision in the near-wall region, where the majority of the computational efforts are required, may benefit from low-precision floating point units found in upcoming computer hardware.

  • 45.
    Karp, Martin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). Division of Computational Science and Technology, EECS, KTH Royal Institute of Technology, Stockholm, Sweden.
    Massaro, Daniele
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics. SimEx/FLOW, Engineering Mechanics, KTH Royal Institute of Technology, Stockholm, Sweden.
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. PDC Centre for High Performance Computing, EECS, KTH Royal Institute of Technology, Stockholm, Sweden.
    Hart, Alistair
    Hewlett Packard Enterpise (HPE), UK.
    Wahlgren, Jacob
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). Division of Computational Science and Technology, EECS, KTH Royal Institute of Technology, Stockholm, Sweden.
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics. SimEx/FLOW, Engineering Mechanics, KTH Royal Institute of Technology, Stockholm, Sweden.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). Division of Computational Science and Technology, EECS, KTH Royal Institute of Technology, Stockholm, Sweden.
    Large-scale direct numerical simulations of turbulence using GPUs and modern Fortran2023In: The international journal of high performance computing applications, ISSN 1094-3420, E-ISSN 1741-2846Article in journal (Refereed)
    Abstract [en]

    We present our approach to making direct numerical simulations of turbulence with applications in sustainable shipping. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by performing the world’s first direct numerical simulation of the flow around a Flettner rotor at Re = 30,000 and its interaction with a turbulent boundary layer. We present a performance comparison between the AMD Instinct MI250X and Nvidia A100 GPUs for scalable computational fluid dynamics. Our results show that one MI250X offers performance on par with two A100 GPUs and has a similar power efficiency based on readings from on-chip energy sensors.

  • 46.
    Karp, Martin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Kenter, Tobias
    Paderborn University.
    Plessl, Christian
    Paderborn University.
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics. KTH, School of Engineering Sciences (SCI), Centres, Linné Flow Center, FLOW. KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Appendix to High-Performance Spectral Element Methods on Field-Programmable Gate Arrays2020Other (Other academic)
    Abstract [en]

    In this Appendix we display some results we omitted fromour article ”High-Performance Spectral Element Methods onField-Programmable Gate Arrays”. In particular we showcasethe measured bandwidth for the FPGA we used (Stratix 10) aswell as the performance for our accelerator at different stagesof optimization. In addition to this, we show illustrate morepractical aspects of our performance/resource modeling

    Improvements in computer systems have historically relied on two well-known observations: Moore's law and Dennard's scaling. Today, both these observations are ending, forcing computer users, researchers, and practitioners to abandon the comforts of general-purpose architectures in favor of emerging post-Moore systems. Among the most salient of these post-Moore systems is the Field-Programmable Gate Array (FPGA), which strikes a good balance between complexity and performance.In this paper, we study modern FPGAs' applicability for use in accelerating the Spectral Element Method (SEM) core to many computational fluid dynamics (CFD) applications. We design a custom SEM hardware accelerator that we evaluate and empirically evaluate on the latest Stratix 10 SX-series FPGAs and position its performance (and power-efficiency) against state-of-the-art systems such as ARM ThunderX2, NVIDIA Pascal/Volta/Ampere Tesla-series cards, and general-purpose manycore CPUs. Finally, we develop a performance model for our SEM-accelerator, which we use to project the performance and role of future FPGAs to accelerator CFD applications, ultimately answering the question: what characteristics would a perfect FPGA for CFD applications have?

    Download full text (pdf)
    fulltext
  • 47.
    Karp, Martin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Kenter, Tobias
    Paderborn University.
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Plessl, Christian
    Paderborn University.
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays: Design, Evaluation, and Future Challenges2022In: HPCAsia2022: International Conference on High Performance Computing in Asia-Pacific Region, Association for Computing Machinery (ACM) , 2022, p. 125-136Conference paper (Refereed)
    Abstract [en]

    The impending termination of Moore’s law motivates the search for new forms of computing to continue the performance scaling we have grown accustomed to. Among the many emerging Post-Moore computing candidates, perhaps none is as salient as the Field-Programmable Gate Array (FPGA), which offers the means of specializing and customizing the hardware to the computation at hand.

    In this work, we design a custom FPGA-based accelerator for a computational fluid dynamics (CFD) code. Unlike prior work – which often focuses on accelerating small kernels – we target the entire Poisson solver on unstructured meshes based on the high-fidelity spectral element method (SEM) used in modern state-of-the-art CFD systems. We model our accelerator using an analytical performance model based on the I/O cost of the algorithm. We empirically evaluate our accelerator on a state-of-the-art Intel Stratix 10 FPGA in terms of performance and power consumption and contrast it against existing solutions on general-purpose processors (CPUs). Finally, we propose a data movement-reducing technique where we compute geometric factors on the fly, which yields significant (700+ Gflop/s) single-precision performance and an upwards of 2x reduction in runtime for the local evaluation of the Laplace operator.

    We end the paper by discussing the challenges and opportunities of using reconfigurable architecture in the future, particularly in the light of emerging (not yet available) technologies.

  • 48.
    Karp, Martin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Suarez, Estela
    Forschungszentrum Julich GmbH, Juelich Supercomputing Centre; Rheinische Friedrich-Wilhelms-Universität Bonn, Institut für Informatik.
    Meinke, Jan
    Forschungszentrum Julich GmbH, Juelich Supercomputing Centre.
    Andersson, Måns
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics, Turbulent simulations laboratory.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Experience and Analysis of Scalable High-Fidelity Computational Fluid Dynamics on Modular Supercomputing ArchitecturesManuscript (preprint) (Other academic)
    Abstract [en]

    The never-ending computational demand from simulations of turbulence makes computational fluid dynamics (CFD) a prime application use case for current and future exascale systems. High-order finite element methods, such as the spectral element method, have been gaining traction as they offer high performance on both multicore CPUs and modern GPU-based accelerators. In this work, we assess how high-fidelity CFD using the spectral element method can exploit the modular supercomputing architecture at scale through domain partitioning, where the computational domain is split between GPUs and CPUs. We investigate several different flow cases and computer systems based on the MSA. We observe that for our simulations, the communication overhead and load balancing issues incurred by incorporating different computing architectures are seldom worthwhile, especially when I/O is also considered, but when the simulation at hand requires more than the combined global memory on the GPUs, utilizing additional CPUs to increase the available memory can be fruitful. We support our results with a simple performance model to assess when running across modules might be beneficial. For a smaller supercomputer where the computation takes significant amounts of time on the CPU module, it can be beneficial to also use a GPU module to decrease the execution time significantly.

  • 49.
    Khan, Monsurul
    et al.
    Purdue Univ, Dept Mech Engn, Indiana, PA 47905 USA..
    More, Rishabh V.
    Purdue Univ, Dept Mech Engn, Indiana, PA 47905 USA.;MIT, Dept Mech Engn, Cambridge, MA 02139 USA..
    Banaei, Arash Alizad
    KTH, School of Engineering Sciences (SCI), Centres, Linné Flow Center, FLOW. KTH, Centres, SeRC - Swedish e-Science Research Centre. KTH, School of Engineering Sciences (SCI), Engineering Mechanics. KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Brandt, Luca
    KTH, School of Engineering Sciences (SCI), Centres, Linné Flow Center, FLOW. KTH, Centres, SeRC - Swedish e-Science Research Centre. KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics.
    Ardekani, Arezoo M.
    Purdue Univ, Dept Mech Engn, Indiana, PA 47905 USA..
    Rheology of concentrated fiber suspensions with a load-dependent friction coefficient2023In: Physical Review Fluids, E-ISSN 2469-990X, Vol. 8, no 4, article id 044301Article in journal (Refereed)
    Abstract [en]

    We investigate the effects of fiber aspect ratio, roughness, flexibility, and volume fraction on the rheology of concentrated suspensions in a steady shear flow using direct numerical simulations. We model the fibers as inextensible continuous flexible slender bodies with the Euler-Bernoulli beam equation governing their dynamics suspended in an incompressible Newtonian fluid. The fiber dynamics and fluid flow coupling is achieved using the immersed boundary method. In addition, the fiber surface roughness might lead to interfiber contacts, resulting in normal and tangential forces between the fibers, which follow Coulomb's law of friction. The surface roughness is modeled as hemispherical pro-trusions on the fiber surfaces. Their deformation results in a normal load-dependent friction coefficient. Our simulations accurately predict the experimentally observed shear thinning in fiber suspensions. Furthermore, we find that the suspension viscosity eta increases with increasing the volume fraction, roughness, fiber rigidity, and aspect ratio. The increase in eta is the macroscopic manifestation of a similar increase in the microscopic contact contribution to the total stress with these parameters. In addition, we observe positive and negative first N1 and second N2 normal stress differences, respectively, with |N2| < |N1|, in agreement with previous experiments. Last, we propose a modified Maron-Pierce law to quantify the reduction in the jamming volume fraction by increasing the fiber aspect ratio and roughness. Our results and analysis establish the use of fiber surface tribology to tune the suspension flow behavior.

  • 50.
    Kjellsson Lindblom, Tor
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. OsloMet Oslo Metropolitan Univ, Fac Technol Art & Design, NO-0130 Oslo, Norway..
    Forre, Morten
    Univ Bergen, Dept Phys & Technol, NO-5020 Bergen, Norway..
    Lindroth, Eva
    Stockholm Univ, Dept Phys, SE-10691 Stockholm, Sweden..
    Selsto, Solve
    OsloMet Oslo Metropolitan Univ, Fac Technol Art & Design, NO-0130 Oslo, Norway..
    Relativistic effects in photoionizing a circular Rydberg state in the optical regime2020In: Physical Review A: covering atomic, molecular, and optical physics and quantum information, ISSN 2469-9926, E-ISSN 2469-9934, Vol. 102, no 6, article id 063108Article in journal (Refereed)
    Abstract [en]

    We study the photoionization process of a hydrogen atom initially prepared in a circular Rydberg state. The atom is exposed to a two-cycle laser pulse with a central wavelength of 800 nm. Before the atom approaches saturation, at field intensities of the order of 10(17) W/cm(2), relativistic corrections to the ionization probability are clearly seen. The ionization is predominantly driven by the radiation pressure in the propagation direction of the laser field, not by the electric field. Direct comparisons with the full numerical solution of the time-dependent Dirac equation demonstrate quantitative agreement with a semirelativistic approximation, which is considerably easier to implement.

12 1 - 50 of 100
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf