kth.sePublications
Change search
Refine search result
12 1 - 50 of 54
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Adhi, Boma
    et al.
    Center for Computational Science (R-CCS), RIKEN, Japan.
    Cortes, Carlos
    Center for Computational Science (R-CCS), RIKEN, Japan.
    Sozzo, Emanuele Del
    Center for Computational Science (R-CCS), RIKEN, Japan.
    Ueno, Tomohiro
    Center for Computational Science (R-CCS), RIKEN, Japan.
    Tan, Yiyu
    Iwate University, Faculty of Science and Engineering, Japan.
    Kojima, Takuya
    Center for Computational Science (R-CCS), RIKEN, Japan; The University of Tokyo, Graduate School of Information Science and Technology, Japan.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Sano, Kentaro
    Center for Computational Science (R-CCS), RIKEN, Japan.
    Less for more: reducing intra-cgra connectivity for higher performance and efficiency in hpc2023In: 2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023, Institute of Electrical and Electronics Engineers (IEEE) , 2023, p. 452-459Conference paper (Refereed)
    Abstract [en]

    Coarse-Grained Reconfigurable Arrays (CGRAs) are a class of reconfigurable architectures that inherit the performance of Domain-specific accelerators and the reconfigurability aspects of Field-Programmable Gate Arrays (FPGAs). Historically, CGRAs have been successfully used to accelerate embedded applications and are now considered to accelerate High-Performance Computing (HPC) applications in future supercomputers. However, embedded systems and supercomputers are two vastly different domains with different applications and constraints, and it is today not fully understood what CGRA design decisions adequately cater to the HPC market. One such unknown design decision is regarding the interconnect that facilitates intra-CGRA communication. Our findings show that even the typical king-style mesh-like topology is often under-utilized with a typical HPC workload, leading to inefficiency. This research aims to explore the provisioning of intra-CGRA interconnect for HPC-oriented workloads and, ultimately, recoup the potential performance and efficiency lost by reducing the interconnect complexity. We proposed several reduced interconnect topologies based on the usage statistic. Then we evaluate the tradeoffs regarding hardware cost, routability of DFGs, and computational throughput.

  • 2.
    Adhi, Boma
    et al.
    RIKEN, Ctr Computat Sci R CCS, Wako, Saitama, Japan..
    Cortes, Carlos
    RIKEN, Ctr Computat Sci R CCS, Wako, Saitama, Japan..
    Tan, Yiyu
    RIKEN, Ctr Computat Sci R CCS, Wako, Saitama, Japan..
    Kojima, Takuya
    RIKEN, Ctr Computat Sci R CCS, Wako, Saitama, Japan.;Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan..
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Sano, Kentaro
    RIKEN, Ctr Computat Sci R CCS, Wako, Saitama, Japan..
    Exploration Framework for Synthesizable CGRAs Targeting HPC: Initial Design and Evaluation2022In: 2022 IEEE 36Th International Parallel And Distributed Processing Symposium Workshops (IPDPSW 2022), Institute of Electrical and Electronics Engineers (IEEE) , 2022, p. 639-646Conference paper (Refereed)
    Abstract [en]

    Among the more salient accelerator technologies to continue performance scaling in High-Performance Computing (HPC) are Coarse-Grained Reconfigurable Arrays (CGRAs). However, what benefits CGRAs will bring to HPC workloads and how those benefits will be reaped is an open research question today. In this work, we propose a framework to explore the design space of CGRAs for HPC workloads, which includes a tool flow of compilation and simulation, a CGRA HDL library written in SystemVerilog, and a synthesizable CGRA design as a baseline. Using RTL simulation, we evaluate two well-known computation kernels with the baseline CGRA for multiple different architectural parameters. The simulation results demonstrate both correctness and usefulness of our exploration framework.

  • 3.
    Adhi, Boma
    et al.
    RIKEN, Ctr Computat Sci R CCS, Kobe, Hyogo, Japan..
    Cortes, Carlos
    RIKEN, Ctr Computat Sci R CCS, Kobe, Hyogo, Japan..
    Tan, Yiyu
    Iwate Univ, Dept Syst Innovat Engn, Sci & Engn, Morioka, Iwate, Japan..
    Kojima, Takuya
    RIKEN, Ctr Computat Sci R CCS, Kobe, Hyogo, Japan.;Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan..
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Sano, Kentaro
    RIKEN, Ctr Computat Sci R CCS, Kobe, Hyogo, Japan..
    The Cost of Flexibility: Embedded versus Discrete Routers in CGRAs for HPC2022In: 2022 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER 2022), Institute of Electrical and Electronics Engineers (IEEE) , 2022, p. 347-356Conference paper (Refereed)
    Abstract [en]

    Coarse-Grained Reconfigurable Arrays (CGRAs) are a class of reconfigurable architectures that inherit the performance and usability properties of Central Processing Units (CPUs) and the reconfigurability aspects of Field-Programmable Gate Arrays (FPGAs). Historically, CGRAs have been successfully used to accelerate embedded applications and are today also being considered to accelerate High-Performance Computing (HPC) applications in future supercomputers. However, embedded systems and supercomputers are two vastly different domains with different applications and constraints, and it is today not fully understood what CGRA design decisions adequately cater to the HPC market. One such unknown design decision is regarding the interconnect that facilitates intra-CGRA communication. Today, intra-CGRA communication comes in two flavors: using routers closely embedded into the compute units or using discrete routers outside the compute units. The former trades flexibility for a reduction in hardware cost, while the latter has greater flexibility but is more resource hungry. In this paper, we aspire to understand which of both designs best suits the CGRA HPC segment. We extend our previous methodology, which consists of both a parameterized CGRA design and an OpenMP-capable compiler, to accommodate both types of routing designs, including verification tests using RTL simulation. Our results show that the discrete router design can facilitate better use of processing elements (PEs) compared to embedded routers and can achieve up to 79.27% reduction in unnecessary PE occupancy for an aggressively unrolled stencil kernel on a 18 x 16 CGRA at a (estimated) hardware resource overhead cost of 6.3x. This reduction in PE occupancy can be used, for example, to exploit instruction-level parallelism (ILP) through even more aggressive unrolling.

  • 4.
    Adhi, Boma
    et al.
    Center for Computational Science (R-CCS), RIKEN, Japan.
    Cortes, Carlos
    Center for Computational Science (R-CCS), RIKEN, Japan.
    Ueno, Tomohiro
    Center for Computational Science (R-CCS), RIKEN, Japan.
    Tan, Yiyu
    Iwate University, Department of Systems Innovation Engineering, Japan.
    Kojima, Takuya
    Graduate School of Information Science and Technology, The University of Tokyo, Japan.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Sano, Kentaro
    Center for Computational Science (R-CCS), RIKEN, Japan.
    Exploring Inter-tile Connectivity for HPC-oriented CGRA with Lower Resource Usage2022In: FPT 2022: 21st International Conference on Field-Programmable Technology, Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2022Conference paper (Refereed)
    Abstract [en]

    This research aims to explore the tradeoffs between routing flexibility and hardware resource usage, ultimately reducing the resource usage of our CGRA architecture while maintaining compute efficiency. we investigate statistics of connection usages among switch blocks for benchmark DFGs, propose several CGRA architecture with a reduced connection, and evaluate their hardware cost, routability of DFGs, and computational throughput for benchmarks. We found that the topology with horizontal plus diagonal connection saves about 30% of the resource usage while maintaining virtually the same routing flexibility as the full connectivity topology.

  • 5.
    Alexandru, Iordan
    et al.
    Norwegian University of Science and Technology Trondheim.
    Podobas, Artur
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Natvig, Lasse
    Norwegian University of Science and Technology Trondheim.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Investigating the Potential of Energy-savings Using a Fine-grained Task Based Programming Model on Multi-cores2011Conference paper (Refereed)
    Abstract [en]

    In this paper we study the relation between energy-efficiencyand parallel executions when implemented with a fine-grained task-centricprogramming model. Using a simulation framework comprised of an ar-chitectural simulator and a power and area estimation tool, we haveinvestigated the potential energy-savings when employing parallelism onmulti-cores system. In our experiments with 2 - 8 multi-cores systems,we employed frequency and voltage scaling in order to keep the relativeperformance of the systems constant and measured the energy-efficiencyusing the Energy-delay-product. Also, we compared the energy consump-tion of the parallel execution against the serial one. Our results showthat through judicious choice of load balancing parameters, significantimprovements of around 200 % in energy consumption can be acheived.

    Download full text (pdf)
    iordan-podobas-a4mmc-2011.pdf
  • 6.
    Andersson, Måns
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Natarajan Arul, Murugan
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Breaking Down the Parallel Performance of GROMACS, a High-Performance Molecular Dynamics Software2023In: PPAM 2022. Lecture Notes in Computer Science, vol 13826., Springer Nature , 2023, p. 333-345Conference paper (Refereed)
    Abstract [en]

    GROMACS is one of the most widely used HPC software packages using the Molecular Dynamics (MD) simulation technique. In this work, we quantify GROMACS parallel performance using different configurations, HPC systems, and FFT libraries (FFTW, Intel MKL FFT, and FFT PACK). We break down the cost of each GROMACS computational phase and identify non-scalable stages, such as MPI communication during the 3D FFT computation when using a large number of processes. We show that the Particle-Mesh Ewald phase and the 3D FFT calculation significantly impact the GROMACS performance. Finally, we discuss performance opportunities with a particular interest in developing GROMACS for the FFT calculations.

  • 7. Bonnichsen, L.
    et al.
    Podobas, Artur
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Using transactional memory to avoid blocking in OpenMP synchronization directives: Don’t wait, speculate!2015In: 11th International Workshop on OpenMP, IWOMP 2015, Springer, 2015, p. 149-161Conference paper (Refereed)
    Abstract [en]

    OpenMP applications with abundant parallelism are often characterized by their high-performance. Unfortunately, OpenMP applications with a lot of synchronization or serialization-points perform poorly because of blocking, i.e. the threads have to wait for each other. In this paper, we present methods based on hardware transactional memory (HTM) for executing OpenMP barrier, critical, and taskwait directives without blocking. Although HTM is still relatively new in the Intel and IBM architectures, we experimentally show a 73% performance improvement over traditional locking approaches, and 23% better than other HTM approaches on critical sections. Speculation over barriers can decrease execution time by up-to 41 %. We expect that future systems with HTM support and more cores will have a greater benefit from our approach as they are more likely to block.

  • 8.
    Brown, Nick
    et al.
    Univ Edinburgh, EPCC, Edinburgh, Midlothian, Scotland..
    Nash, Rupert
    Univ Edinburgh, EPCC, Edinburgh, Midlothian, Scotland..
    Gibb, Gordon
    Univ Edinburgh, EPCC, Edinburgh, Midlothian, Scotland..
    Belikov, Evgenij
    Univ Edinburgh, EPCC, Edinburgh, Midlothian, Scotland..
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Chien, Wei Der
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Flatken, Markus
    German Aerosp Ctr DLR, Braunschweig, Germany..
    Gerndt, Andreas
    German Aerosp Ctr DLR, Braunschweig, Germany..
    Workflows to Driving High-Performance Interactive Supercomputing for Urgent Decision Making2022In: High Performance Computing, Isc High Performance 2022 International Workshops / [ed] Anzt, H Bienz, A Luszczek, P Baboulin, M, Springer Nature , 2022, Vol. 13387, p. 233-244Conference paper (Refereed)
    Abstract [en]

    Interactive urgent computing is a small but growing user of supercomputing resources. However there are numerous technical challenges that must be overcome to make supercomputers fully suited to the wide range of urgent workloads which could benefit from the computational power delivered by such instruments. An important question is how to connect the different components of an urgent workload; namely the users, the simulation codes, and external data sources, together in a structured and accessible manner. In this paper we explore the role of workflows from both the perspective of marshalling and control of urgent workloads, and at the individual HPC machine level. Ultimately requiring two workflow systems, by using a space weather prediction urgent use-cases, we explore the benefit that these two workflow systems provide especially when one exploits the flexibility enabled by them interoperating.

  • 9.
    Brown, Nick
    et al.
    Univ Edinburgh, EPCC, Edinburgh, Midlothian, Scotland..
    Nash, Rupert
    Univ Edinburgh, EPCC, Edinburgh, Midlothian, Scotland..
    Poletti, Piero
    Bruno Kessler Fdn, Trento, Italy..
    Guzzetta, Giorgio
    Bruno Kessler Fdn, Trento, Italy..
    Manica, Mattia
    Bruno Kessler Fdn, Trento, Italy..
    Zardini, Agnese
    Bruno Kessler Fdn, Trento, Italy..
    Flatken, Markus
    German Aerosp Ctr DLR, Braunschweig, Germany..
    Vidal, Jules
    Sorbonne Univ, Paris, France..
    Gueunet, Charles
    Kitware, Lyon, France..
    Belikov, Evgenij
    Univ Edinburgh, EPCC, Edinburgh, Midlothian, Scotland..
    Tierny, Julien
    Sorbonne Univ, Paris, France..
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Chien, Wei Der
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Gerndt, Andreas
    German Aerosp Ctr DLR, Braunschweig, Germany..
    Utilising urgent computing to tackle the spread of mosquito-borne diseases2021In: Proceedings of Urgenthpc 2021: The Third International Workshop On Hpc For Urgent Decision Making, Institute of Electrical and Electronics Engineers (IEEE) , 2021, p. 36-44Conference paper (Refereed)
    Abstract [en]

    It is estimated that around 80% of the world's population live in areas susceptible to at-least one major vector borne disease, and approximately 20% of global communicable diseases are spread by mosquitoes. Furthermore, the outbreaks of such diseases are becoming more common and widespread, with much of this driven in recent years by socio-demographic and climatic factors. These trends are causing significant worry to global health organisations, including the CDC and WHO, and-so an important question is the role that technology can play in addressing them. In this work we describe the integration of an epidemiology model, which simulates the spread of mosquito-borne diseases, with the VESTEC urgent computing ecosystem. The intention of this work is to empower human health professionals to exploit this model and more easily explore the progression of mosquito-borne diseases. Traditionally in the domain of the few research scientists, by leveraging state of the art visualisation and analytics techniques, all supported by running the computational workloads on HPC machines in a seamless fashion, we demonstrate the significant advantages that such an integration can provide. Furthermore we demonstrate the benefits of using an ecosystem such as VESTEC, which provides a framework for urgent computing, in supporting the easy adoption of these technologies by the epidemiologists and disaster response professionals more widely.

  • 10.
    Chien, Steven W. D.
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Peng, I. B.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads2020In: Proceedings - IEEE International Conference on Cluster Computing, ICCC, Institute of Electrical and Electronics Engineers Inc. , 2020, p. 359-370Conference paper (Refereed)
    Abstract [en]

    Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and this can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and allows instrumentation of TensorFlow operations. However, the current profiler only enables analysis at the TensorFlow platform level and does not provide system-level information. In this paper, we extend TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that performs instrumentation through Darshan. We use the same Darshan shared instrumentation library and implement a runtime attachment without using a system preload. We can extract Darshan profiling data structures during TensorFlow execution to enable analysis through the TensorFlow profiler. We visualize the performance results through TensorBoard, the web-based TensorFlow visualization tool. At the same time, we do not alter Darshan's existing implementation. We illustrate tf-Darshan by performing two case studies on ImageNet image and Malware classification. We show that by guiding optimization using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19% by selecting data for staging on fast tier storage. We also show that Darshan has the potential of being used as a runtime library for profiling and providing information for future optimization.

  • 11.
    Chien, Steven W.D.
    et al.
    University of Edinburgh, United Kingdom.
    Sato, Kento
    RIKEN Center for Computational Science Japan.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS.
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Honda, Michio
    University of Edinburgh, United Kingdom.
    Improving Cloud Storage Network Bandwidth Utilization of Scientific Applications2023In: Proceedings of the 7th Asia-Pacific Workshop on Networking, APNET 2023, Association for Computing Machinery (ACM) , 2023, p. 172-173Conference paper (Refereed)
    Abstract [en]

    Cloud providers began to provide managed services to attract scientific applications, which have been traditionally executed on supercomputers. One example is AWS FSx for Lustre, a fully managed parallel file system (PFS) released in 2018. However, due to the nature of scientific applications, the frontend storage network bandwidth is left completely idle for the majority of its lifetime. Furthermore, the pricing model does not match the scalability requirement. We propose iFast, a novel host-side caching mechanism for scientific applications that improves storage bandwidth utilization and end-to-end application performance: by overlapping compute and data writeback through inexpensive local storage. iFast supports the Massage Passing Interface (MPI) library that is widely used by scientific applications and is implemented as a preloaded library. It requires no change to applications, the MPI library, or support from cloud operators. We demonstrate how iFast can accelerate the end-to-end time of a representative scientific application Neko, by 13-40%.

  • 12.
    Chien, Wei Der
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Nylund, Jonas
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Bengtsson, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Peng, I. B.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    SputniPIC: An implicit particle-in-cell code for multi-GPU systems2020In: Proceedings - Symposium on Computer Architecture and High Performance Computing, IEEE Computer Society , 2020, p. 149-156Conference paper (Refereed)
    Abstract [en]

    Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlapping communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters.

  • 13.
    Chien, Wei Der
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Svedin, Martin
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Tkachuk, Andriy
    Seagate Systems UK.
    El Sayed, Salem
    Jülich Supercomputing Centre, Forschungszentrum Jülich.
    Herman, Pawel
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Umanesan, Ganesan
    Seagate Systems UK.
    Narasimhamurthy, Sai
    Seagate Systems UK.
    Markidis, Stefano
    KTH, Centres, SeRC - Swedish e-Science Research Centre. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    NoaSci: A Numerical Object Array Library for I/O of Scientific Applications on Object Storage2022In: 2022 30th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, Institute of Electrical and Electronics Engineers (IEEE) , 2022Conference paper (Refereed)
    Abstract [en]

    The strong consistency and stateful workflow are seen as the major factors for limiting parallel I/O performance because of the need for locking and state management. While the POSIX-based I/O model dominates modern HPC storage infrastructure, emerging object storage technology can potentially improve I/O performance by eliminating these bottlenecks.Despite a wide deployment on the cloud, its adoption in HPCremains low. We argue one reason is the lack of a suitable programming interface for parallel I/O in scientific applications. In this work, we introduce NoaSci, a Numerical Object Arraylibrary for scientific applications. NoaSci supports different data formats (e.g. HDF5, binary), and focuses on supporting node-local burst buffers and object stores. We demonstrate for the first time how scientific applications can perform parallel I/Oon Seagate’s Motr object store through NoaSci. We evaluate NoaSci’s preliminary performance using the iPIC3D spaceweather application and position against existing I/O methods.

  • 14.
    Domke, Jens
    et al.
    RIKEN, CCS, Kobe, Hyogo, Japan.;Tokyo Inst Technol, Tokyo, Japan..
    Vatai, Emil
    RIKEN, CCS, Kobe, Hyogo, Japan.;Tokyo Inst Technol, Tokyo, Japan..
    Drozd, Aleksandr
    RIKEN, CCS, Kobe, Hyogo, Japan.;Tokyo Inst Technol, Tokyo, Japan..
    Chen, Peng
    Natl Inst Adv Ind Sci & Technol, Tokyo, Japan..
    Oyama, Yosuke
    Tokyo Inst Technol, Tokyo, Japan..
    Zhang, Lingqi
    Tokyo Inst Technol, Tokyo, Japan..
    Salaria, Shweta
    RIKEN, CCS, Kobe, Hyogo, Japan.;Tokyo Inst Technol, Tokyo, Japan..
    Mukunoki, Daichi
    RIKEN, CCS, Kobe, Hyogo, Japan..
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Wahib, Mohamed
    RIKEN, CCS, Kobe, Hyogo, Japan.;Natl Inst Adv Ind Sci & Technol, Tokyo, Japan..
    Matsuoka, Satoshi
    RIKEN, CCS, Kobe, Hyogo, Japan.;Tokyo Inst Technol, Tokyo, Japan..
    Matrix Engines for High Performance Computing: A Paragon of Performance or Grasping at Straws?2021In: 2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), Institute of Electrical and Electronics Engineers (IEEE) , 2021, p. 1056-1065Conference paper (Refereed)
    Abstract [en]

    Matrix engines or units, in different forms and affinities, are becoming a reality in modern processors; CPUs and otherwise. The current and dominant algorithmic approach to Deep Learning merits the commercial investments in these units, and deduced from the No.1 benchmark in supercomputing, namely High Performance Linpack, one would expect an awakened enthusiasm by the HPC community, too. Hence, our goal is to identify the practical added benefits for HPC and machine learning applications by having access to matrix engines. For this purpose, we perform an in-depth survey of software stacks, proxy applications and benchmarks, and historical batch job records. We provide a cost-benefit analysis of matrix engines, both asymptotically and in conjunction with state-of-the-art processors. While our empirical data will temper the enthusiasm, we also outline opportunities to misuse these dense matrix-multiplication engines if they come for free.

  • 15.
    Domke, Jens
    et al.
    RIKEN Center for Computational Science, 7-1-26 Minatojima-minamimachi, Chuo-ku, Kobe, Hyogo, Japan, 650-0047, 7-1-26 Minatojima-minamimachi, Chuo-ku, Hyogo.
    Vatai, Emil
    RIKEN Center for Computational Science, 7-1-26 Minatojima-minamimachi, Chuo-ku, Kobe, Hyogo, Japan, 650-0047, 7-1-26 Minatojima-minamimachi, Chuo-ku, Hyogo.
    Gerofi, Balazs
    Intel Corporation, 2111 NE 25th Ave, Hillsboro, Oregon, United States, 97124, 2111 NE 25th Ave.
    Kodama, Yuetsu
    RIKEN Center for Computational Science, 7-1-26 Minatojima-minamimachi, Chuo-ku, Kobe, Hyogo, Japan, 650-0047, 7-1-26 Minatojima-minamimachi, Chuo-ku, Hyogo.
    Wahib, Mohamed
    RIKEN Center for Computational Science, 7-1-26 Minatojima-minamimachi, Chuo-ku, Kobe, Hyogo, Japan, 650-0047, 7-1-26 Minatojima-minamimachi, Chuo-ku, Hyogo.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS. KTH Royal Institute of Technology, Brinellvägen 8, Stockholm, Stockholm, Sweden, 114 28, Brinellvägen 8, Stockholm.
    Mittal, Sparsh
    Indian Institute of Technology, Roorkee - Haridwar Highway, Roorkee, Uttarakhand, India, Roorkee - Haridwar Highway, Uttarakhand.
    Pericàs, Miquel
    Chalmers University of Technology, Chalmersplatsen 4, Göteborg, Västra Götaland, Sweden, 412 96, Chalmersplatsen 4, Västra Götaland.
    Zhang, Lingqi
    Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo, Tokyo, 2-12-1 Ookayama, Meguro-ku, Tokyo.
    Chen, Peng
    National Institute of Advanced Industrial Science and Technology, 1-8-31 Midorigaoka, Ikeda-ku, Osaka, Osaka, Japan, 563-0026, 1-8-31 Midorigaoka, Ikeda-ku, Osaka.
    Drozd, Aleksandr
    RIKEN Center for Computational Science, 7-1-26 Minatojima-minamimachi, Chuo-ku, Kobe, Hyogo, Japan, 650-0047, 7-1-26 Minatojima-minamimachi, Chuo-ku, Hyogo.
    Matsuoka, Satoshi
    RIKEN Center for Computational Science, 7-1-26 Minatojima-minamimachi, Chuo-ku, Kobe, Hyogo, Japan, 650-0047, 7-1-26 Minatojima-minamimachi, Chuo-ku, Hyogo.
    At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads2023In: ACM Transactions on Architecture and Code Optimization (TACO), ISSN 1544-3566, E-ISSN 1544-3973, Vol. 20, no 4, article id 57Article in journal (Refereed)
    Abstract [en]

    Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56× for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.

  • 16.
    Dykes, Tim
    et al.
    HPE HPC/AI EMEA Research Lab.
    Foyer, Clément
    HPE HPC/AI EMEA Research Lab, HPC Research Group, Univ. of Bristol.
    Richardson, Harvey
    HPE HPC/AI EMEA Research Lab.
    Svedin, Martin
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Tate, Adrian
    Numerical Algorithms Group Ltd. (NAG).
    McIntosh-Smith, Simon
    HPC Research Group, Univ. of Bristol.
    Mamba: Portable Array-based Abstractions for Heterogeneous High-Performance Systems2021In: 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC), Institute of Electrical and Electronics Engineers (IEEE) , 2021Conference paper (Refereed)
    Abstract [en]

    High performance computing architectures have become increasingly heterogeneous in recent times. This growing architectural variety presents a multi-faceted portability problem affecting applications, libraries, programming models, languages, compilers, run-times, and system software. Approaches for performance portability typically focus heavily on efficient usage of parallel compute architectures and less on data locality abstractions and complex memory systems, with minimal support afforded to effective memory management in traditional HPC languages such as C and Fortran. We present Mamba, a library to facilitate usage of heterogeneous memory systems by high performance application/library developers through high level array-based abstractions for memory management supported by a low-level generic memory API. We detail the library design and implementation, demonstrating generic memory allocation, data layout specification, array tiling and heterogeneous transport. We evaluate performance in the context of a typical matrix transposition, DNA sequencing benchmark, and an application use case for high-order spectral element based incompressible flow.

  • 17.
    Flatken, Markus
    et al.
    Institute for Software Technology (SC), Software for Space Systems and Interactive Visualization, German Aerospace Center (DLR), Braunschweig, Germany.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS.
    Chien, Wei Der
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Gerndt, Andreas
    Institute for Software Technology (SC), Software for Space Systems and Interactive Visualization, German Aerospace Center (DLR), Braunschweig, Germany.
    et al.,
    VESTEC: Visual Exploration and Sampling Toolkit for Extreme Computing2023In: IEEE Access, E-ISSN 2169-3536, Vol. 11, p. 87805-87834Article in journal (Refereed)
    Abstract [en]

    Natural disasters and epidemics are unfortunate recurring events that lead to huge societal and economic loss. Recent advances in supercomputing can facilitate simulations of such scenarios in (or even ahead of) real-time, therefore supporting the design of adequate responses by public authorities. By incorporating high-velocity data from sensors and modern high-performance computing systems, ensembles of simulations and advanced analysis enable urgent decision-makers to better monitor the disaster and to employ necessary actions (e.g., to evacuate populated areas) for mitigating these events. Unfortunately, frameworks to support such versatile and complex workflows for urgent decision-making are only rarely available and often lack in functionalities. This paper gives an overview of the VESTEC project and framework, which unifies orchestration, simulation, in-situ data analysis, and visualization of natural disasters that can be driven by external sensor data or interactive intervention by the user. We show how different components interact and work together in VESTEC and describe implementation details. To disseminate our experience three different types of disasters are evaluated: a Wildfire in La Jonquera (Spain), a Mosquito-Borne disease in two regions of Italy, and the magnetic reconnection in the Earth magnetosphere.

  • 18.
    He, Yifei
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Andersson, Måns
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics.
    Markidis, Stefano
    KTH, Centres, SeRC - Swedish e-Science Research Centre. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    FFTc: An MLIR Dialect for Developing HPC Fast Fourier Transform Libraries2023In: Euro-Par 2022: Parallel Processing Workshops: Euro-Par 2022 International Workshops, Glasgow, UK, August 22–26, 2022, Revised Selected Papers, 2023, p. 80-92Conference paper (Refereed)
    Abstract [en]

    Discrete Fourier Transform (DFT) libraries are one of the most critical software components for scientific computing. Inspired by FFTW, a widely used library for DFT HPC calculations, we apply compiler technologies for the development of HPC Fourier transform libraries. In this work, we introduce FFTc, a domain-specific language, based on Multi-Level Intermediate Representation (MLIR), for expressing Fourier Transform algorithms. We present the initial design, implementation, and preliminary results of FFTc.

  • 19.
    Huthmann, Jens
    et al.
    Riken Center for Computational Science, Japan.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Sommer, Lukas
    Embedded Systems and Applications Group, TU Darmstadt, Germany.
    Koch, Andreas
    Embedded Systems and Applications Group, TU Darmstadt, Germany.
    Sano, Kentaro
    Riken Center for Computational Science, Japan.
    Extending High-Level Synthesis with High-Performance Computing Performance Visualization2020In: Proceedings - IEEE International Conference on Cluster Computing, ICCC, Institute of Electrical and Electronics Engineers (IEEE) , 2020, p. 371-380Conference paper (Refereed)
    Abstract [en]

    The recent maturity in High-Level Synthesis (HLS) has renewed the interest of using Field-Programmable Gate-Arrays (FPGAs) to accelerate High-Performance Computing (HPC) applications. Today, several studies have shown performance-and power-benefits of using FPGAs compared to existing approaches for a number of application kernels with ample room for improvements. Unfortunately, modern HLS tools offer little support to gain clarity and insight regarding why a certain application behaves as it does on the FPGA, and most experts rely on intuition or abstract performance models. In this work, we hypothesize that existing profiling and visualization tools used in the HPC domain are also usable for understanding performance on FPGAs. We extend an existing HLS tool-chain to support Paraver-a state-of-the-art visualization and profiling tool well-known in HPC. We describe how each of the events and states are collected, and empirically quantify its hardware overhead. Finally, we practically apply our contribution to two different applications, demonstrating how the tool can be used to provide unique insights into application execution and how it can be used to guide optimizations. In this work, we hypothesize that existing profiling and visualization tools used in the HPC domain are also usable for understanding performance on FPGAs. We extend an existing HLS tool-chain to support Paraver-a state-of-the-art visualization and profiling tool well-known in HPC. We describe how each of the events and states are collected, and empirically quantify its hardware overhead. Finally, we practically apply our contribution to two different applications, demonstrating how the tool can be used to provide unique insights into application execution and how it can be used to guide optimizations.

  • 20.
    Huthmann, Jens
    et al.
    Riken Center for Computational ScienceKobeJapan.
    Sommer, Lukas
    Embedded Systems and Applications GroupTU DarmstadtDarmstadtGermany.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Koch, Andreas
    Embedded Systems and Applications GroupTU DarmstadtDarmstadtGermany.
    Sano, Kentaro
    Riken Center for Computational ScienceKobeJapan.
    Openmp device offloading to fpgas using the nymble infrastructure2020In: Lect. Notes Comput. Sci., Springer Science and Business Media Deutschland GmbH , 2020, p. 265-279Conference paper (Refereed)
    Abstract [en]

    Next to GPUs, FPGAs are an attractive target for OpenMP device offloading, as they allow to implement highly efficient, applications-specific accelerators. However, prior approaches to support OpenMP device offloading for FPGAs have been limited by the interfaces provided by the FPGA vendors’ HLS tool interfaces or their integration with the OpenMP runtime, e.g., for data mapping. This work presents an approach to OpenMP device offloading for FPGAs based on the LLVM compiler infrastructure and the Nymble HLS compiler. The automatic compilation flow uses LLVM IR for HLS-specific optimizations and transformation and for the interaction with the Nymble HLS compiler. Parallel OpenMP constructs are automatically mapped to hardware threads executing simultaneously in the generated FPGA accelerator and the accelerator is integrated into libomptarget to support data-mapping. In a case study, we demonstrate the use of the compilation flow and evaluate its performance.

  • 21.
    Jansson, Niclas
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Karp, Martin
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics, Turbulent simulations laboratory.
    Neko: A modern, portable, and scalable framework for high-fidelity computational fluid dynamics2024In: Computers & Fluids, ISSN 0045-7930, E-ISSN 1879-0747, Vol. 275, p. 106243-106243, article id 106243Article in journal (Refereed)
    Abstract [en]

    Computational fluid dynamics (CFD), in particular applied to turbulent flows, is a research area with great engineering and fundamental physical interest. However, already at moderately high Reynolds numbers the computational cost becomes prohibitive as the range of active spatial and temporal scales is quickly widening. Specifically scale-resolving simulations, including large-eddy simulation (LES) and direct numerical simulations (DNS), thus need to rely on modern efficient numerical methods and corresponding software implementations. Recent trends and advancements, including more diverse and heterogeneous hardware in High-Performance Computing (HPC), are challenging software developers in their pursuit for good performance and numerical stability. The well-known maxim “software outlives hardware” may no longer necessarily hold true, and developers are today forced to re-factor their codebases to leverage these powerful new systems. In this paper, we present Neko, a new portable framework for high-order spectral element discretization, targeting turbulent flows in moderately complex geometries. Neko is fully available as open software. Unlike prior works, Neko adopts a modern object-oriented approach in Fortran 2008, allowing multi-tier abstractions of the solver stack and facilitating hardware backends ranging from general-purpose processors (CPUs) down to exotic vector processors and FPGAs. We show that Neko’s performance and accuracy are comparable to NekRS, and thus on-par with Nek5000’s successor on modern CPU machines. Furthermore, we develop a performance model, which we use to discuss challenges and opportunities for high-order solvers on emerging hardware

  • 22.
    Karp, Martin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Centres, Linné Flow Center, FLOW. KTH, Centres, SeRC - Swedish e-Science Research Centre. KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics.
    Markidis, Stefano
    KTH, Centres, SeRC - Swedish e-Science Research Centre. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Optimization of Tensor-product Operations in Nekbone on GPUs2020Conference paper (Refereed)
    Abstract [en]

    In the CFD solver Nek5000, the computation is dominated by the evaluation of small tensor operations. Nekbone is a proxy app for Nek5000 and has previously been ported to GPUs with a mixed OpenACC and CUDA approach. In this work, we continue this effort and optimize the main tensor-product operation in Nekbone further. Our optimization is done in CUDA and uses a different, 2D, thread structure to make the computations layer by layer. This enables us to use loop unrolling as well as utilize registers and shared memory efficiently. Our implementation is then compared on both the Pascal and Volta GPU architectures to previous GPU versions of Nekbone as well as a measured roofline. The results show that our implementation outperforms previous GPU Nekbone implementations by 6-10%. Compared to the measured roofline, we obtain 77-92% of the peak performance for both Nvidia P100 and V100 GPUs for inputs with 1024-4096 elements and polynomial degree 9.

    Download full text (pdf)
    fulltext
  • 23.
    Karp, Martin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Reducing Communication in the Conjugate Gradient Method: A Case Study on High-Order Finite Elements2022In: Proceedings of the Platform for Advanced Scientific Computing Conference, PASC 2022, Association for Computing Machinery (ACM) , 2022, article id 2Conference paper (Refereed)
    Abstract [en]

    Currently, a major bottleneck for several scientific computations is communication, both communication between different processors, so-called horizontal communication, and vertical communication between different levels of the memory hierarchy. With this bottleneck in mind, we target a notoriously communication-bound solver at the core of many high-performance applications, namely the conjugate gradient method (CG). To reduce the communication we present lower bounds on the vertical data movement in CG and go on to make a CG solver with reduced data movement. Using our theoretical analysis we apply our CG solver on a high-performance discretization used in practice, the spectral element method (SEM). Guided by our analysis, we show that for the Poisson equation on modern GPUs we can improve the performance by 30% by both rematerializing the discrete system and by reformulating the system to work on unique degrees of freedom. In order to investigate how horizontal communication can be reduced, we compare CG to two communication-reducing techniques, namely communication-avoiding and pipelined CG. We strong scale up to 4096 CPU cores and showcase performance improvements of upwards of 70% for pipelined CG compared to standard CG when applied on SEM at scale. We show that in addition to improving the scaling capabilities of the solver, initial measurements indicate that the convergence of SEM is largely unaffected by pipelined CG.

  • 24.
    Karp, Martin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Kenter, Tobias
    Paderborn University.
    Plessl, Christian
    Paderborn University.
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics. KTH, School of Engineering Sciences (SCI), Centres, Linné Flow Center, FLOW. KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). KTH, Centres, SeRC - Swedish e-Science Research Centre.
    Appendix to High-Performance Spectral Element Methods on Field-Programmable Gate Arrays2020Other (Other academic)
    Abstract [en]

    In this Appendix we display some results we omitted fromour article ”High-Performance Spectral Element Methods onField-Programmable Gate Arrays”. In particular we showcasethe measured bandwidth for the FPGA we used (Stratix 10) aswell as the performance for our accelerator at different stagesof optimization. In addition to this, we show illustrate morepractical aspects of our performance/resource modeling

    Improvements in computer systems have historically relied on two well-known observations: Moore's law and Dennard's scaling. Today, both these observations are ending, forcing computer users, researchers, and practitioners to abandon the comforts of general-purpose architectures in favor of emerging post-Moore systems. Among the most salient of these post-Moore systems is the Field-Programmable Gate Array (FPGA), which strikes a good balance between complexity and performance.In this paper, we study modern FPGAs' applicability for use in accelerating the Spectral Element Method (SEM) core to many computational fluid dynamics (CFD) applications. We design a custom SEM hardware accelerator that we evaluate and empirically evaluate on the latest Stratix 10 SX-series FPGAs and position its performance (and power-efficiency) against state-of-the-art systems such as ARM ThunderX2, NVIDIA Pascal/Volta/Ampere Tesla-series cards, and general-purpose manycore CPUs. Finally, we develop a performance model for our SEM-accelerator, which we use to project the performance and role of future FPGAs to accelerator CFD applications, ultimately answering the question: what characteristics would a perfect FPGA for CFD applications have?

    Download full text (pdf)
    fulltext
  • 25.
    Karp, Martin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Kenter, Tobias
    Plessl, Christian
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    High-Perfomance Spectral Element Methods on Field-Programmable Gate Arrays: Implementation, Evaluation, and Future Projection2021In: Proceedings of the 35rd IEEE International Parallel & Distributed Processing Symposium, May 17-21, 2021 Portland, Oregon, USA, Institute of Electrical and Electronics Engineers (IEEE) , 2021Conference paper (Refereed)
    Abstract [en]

     Improvements in computer systems have historically relied on two well-known observations: Moore's law and Dennard's scaling. Today, both these observations are ending, forcing computer users, researchers, and practitioners to abandon the general-purpose architectures' comforts in favor of emerging post-Moore systems. Among the most salient of these post-Moore systems is the Field-Programmable Gate Array (FPGA), which strikes a convenient balance between complexity and performance. In this paper, we study modern FPGAs' applicability in accelerating the Spectral Element Method (SEM) core to many computational fluid dynamics (CFD) applications. We design a custom SEM hardware accelerator operating in double-precision that we empirically evaluate on the latest Stratix 10 GX-series FPGAs and position its performance (and power-efficiency) against state-of-the-art systems such as ARM ThunderX2, NVIDIA Pascal/Volta/Ampere Tesla-series cards, and general-purpose manycore CPUs. Finally, we develop a performance model for our SEM-accelerator, which we use to project future FPGAs' performance and role to accelerate CFD applications, ultimately answering the question: what characteristics would a perfect FPGA for CFD applications have? 

  • 26.
    Karp, Martin
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Kenter, Tobias
    Paderborn University.
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Plessl, Christian
    Paderborn University.
    Schlatter, Philipp
    KTH, School of Engineering Sciences (SCI), Engineering Mechanics, Fluid Mechanics and Engineering Acoustics.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays: Design, Evaluation, and Future Challenges2022In: HPCAsia2022: International Conference on High Performance Computing in Asia-Pacific Region, Association for Computing Machinery (ACM) , 2022, p. 125-136Conference paper (Refereed)
    Abstract [en]

    The impending termination of Moore’s law motivates the search for new forms of computing to continue the performance scaling we have grown accustomed to. Among the many emerging Post-Moore computing candidates, perhaps none is as salient as the Field-Programmable Gate Array (FPGA), which offers the means of specializing and customizing the hardware to the computation at hand.

    In this work, we design a custom FPGA-based accelerator for a computational fluid dynamics (CFD) code. Unlike prior work – which often focuses on accelerating small kernels – we target the entire Poisson solver on unstructured meshes based on the high-fidelity spectral element method (SEM) used in modern state-of-the-art CFD systems. We model our accelerator using an analytical performance model based on the I/O cost of the algorithm. We empirically evaluate our accelerator on a state-of-the-art Intel Stratix 10 FPGA in terms of performance and power consumption and contrast it against existing solutions on general-purpose processors (CPUs). Finally, we propose a data movement-reducing technique where we compute geometric factors on the fly, which yields significant (700+ Gflop/s) single-precision performance and an upwards of 2x reduction in runtime for the local evaluation of the Laplace operator.

    We end the paper by discussing the challenges and opportunities of using reconfigurable architecture in the future, particularly in the light of emerging (not yet available) technologies.

  • 27.
    Liu, Felix
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). RaySearch Laboratories.
    Jansson, Niclas
    KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for High Performance Computing, PDC.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Fredriksson, Albin
    RaySearch Laboratories.
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Accelerating Radiation Therapy Dose Calculation with Nvidia GPUs2021In: IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Institute of Electrical and Electronics Engineers (IEEE) , 2021Conference paper (Refereed)
    Abstract [en]

    Radiation Treatment Planning (RTP) is the process of planning the appropriate external beam radiotherapy to combat cancer in human patients. RTP is a complex and compute-intensive task, which often takes a long time (several hours) to compute. Reducing this time allows for higher productivity at clinics and more sophisticated treatment planning, which can materialize in better treatments. The state-of-the-art in medical facilities uses general-purpose processors (CPUs) to perform many steps in the RTP process. In this paper, we explore the use of accelerators to reduce RTP calculating time. We focus on the step that calculates the dose using the Graphics Processing Unit (GPU), which we believe is an excellent candidate for this computation type. Next, we create a highly optimized implementation for a custom Sparse Matrix-Vector Multiplication (SpMV) that operates on numerical formats unavailable in state-of-the-art SpMV libraries (e.g., Ginkgo and cuSPARSE). We show that our implementation is several times faster than the baseline (up-to 4x) and has a higher operational intensity than similar (but different) versions such as Ginkgo and cuSPARSE.

  • 28.
    Markidis, Stefano
    et al.
    KTH, Centres, SeRC - Swedish e-Science Research Centre. KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Peng, Ivy
    Lawrence Livermore Natl Lab, Livermore, CA 94550 USA..
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Jongsuebchoke, Itthinat
    KTH.
    Bengtsson, Gabriel
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Herman, Pawel
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Automatic Particle Trajectory Classification in Plasma Simulations2020In: 2020 IEEE/ACM workshop on machine learning in high performance computing environments (mlhpc 2020) and workshop on artificial intelligence and machine learning for scientific applications (ai4s 2020), Institute of Electrical and Electronics Engineers (IEEE) , 2020, p. 64-71Conference paper (Refereed)
    Abstract [en]

    Numerical simulations of plasma flows are crucial for advancing our understanding of microscopic processes that drive the global plasma dynamics in fusion devices, space, and astrophysical systems. Identifying and classifying particle trajectories allows us to determine specific on-going acceleration mechanisms, shedding light on essential plasma processes. Our overall goal is to provide a general workflow for exploring particle trajectory space and automatically classifying particle trajectories from plasma simulations in an unsupervised manner. We combine pre-processing techniques, such as Fast Fourier Transform (FFT), with Machine Learning methods, such as Principal Component Analysis (PCA), k-means clustering algorithms, and silhouette analysis. We demonstrate our workflow by classifying electron trajectories during magnetic reconnection problem. Our method successfully recovers existing results from previous literature without a priori knowledge of the underlying system. Our workflow can be applied to analyzing particle trajectories in different phenomena, from magnetic reconnection, shocks to magnetospheric flows. The workflow has no dependence on any physics model and can identify particle trajectories and acceleration mechanisms that were not detected before.

  • 29.
    Muddukrishna, Ananya
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Jonsson, Peter A.
    Podobas, Artur
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Grain Graphs: OpenMP Performance Analysis Made Easy2016Conference paper (Refereed)
    Abstract [en]

    Average programmers struggle to solve performance problems in OpenMP programs with tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task performance from the runtime system's perspective where task execution is interleaved with other tasks in an unpredictable order. Problems with OpenMP parallel for-loops are similarly difficult to resolve since tools only visualize aggregate thread-level statistics such as load imbalance without zooming into a per-chunk granularity. The runtime system/threads oriented visualization provides poor support for understanding problems with task and chunk execution time, parallelism, and memory hierarchy utilization, forcing average programmers to rely on experts or use tedious trial-and-error tuning methods for performance. We present grain graphs, a new OpenMP performance analysis method that visualizes grains - computation performed by a task or a parallel for-loop chunk instance - and highlights problems such as low parallelism, work inflation and poor parallelization benefit at the grain level. We demonstrate that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing visualizations in standard OpenMP programs, simplifying OpenMP performance analysis. This enables average programmers to make portable optimizations for poor performing OpenMP programs, reducing pressure on experts and removing the need for tedious trial-and-error tuning.

  • 30.
    Muddukrishna, Ananya
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Podobas, Artur
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Vlassov, Vladimir
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Task Scheduling on Manycore Processors with Home Caches2013In: Euro-Par 2012 Workshops, 2013Conference paper (Refereed)
    Abstract [en]

    Modern manycore processors feature a highly scalable and softwareconfigurablecache hierarchy. For performance, manycore programmers will notonly have to efficiently utilize the large number of cores but also understand andconfigure the cache hierarchy to suit the application. Relief from this manycoreprogramming nightmare can be provided by task-based programming modelswhere programmers parallelize using tasks and an architecture-specific runtimesystem maps tasks to cores and in addition configures the cache hierarchy. In thispaper, we focus on the cache hierarchy of the Tilera TILEPro64 processor whichfeatures a software-configurable coherence waypoint called the home cache. Wefirst show the runtime system performance bottleneck of scheduling tasks obliviousto the nature of home caches. We then demonstrate a technique in whichthe runtime system controls the assignment of home caches to memory blocksand schedules tasks to minimize home cache access penalties. Test results of ourtechnique have shown a significant execution time performance improvement onselected benchmarks leading to the conclusion that by taking processor architecturefeatures into account, task-based programming models can indeed providecontinued performance and allow programmers to smoothly transit from the multicoreto manycore era.

  • 31.
    Natarajan Arul, Murugan
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Gadioli, Davide
    Politecn Milan, Dipartimento Elettron Infomaz & Bioingn, I-20133 Milan, Italy..
    Vitali, Emanuele
    Politecn Milan, Dipartimento Elettron Infomaz & Bioingn, I-20133 Milan, Italy..
    Palermo, Gianluca
    Politecn Milan, Dipartimento Elettron Infomaz & Bioingn, I-20133 Milan, Italy..
    Markidis, Stefano
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    A Review on Parallel Virtual Screening Softwares for High-Performance Computers2022In: Pharmaceuticals, E-ISSN 1424-8247, Vol. 15, no 1, p. 63-, article id 63Article, review/survey (Refereed)
    Abstract [en]

    Drug discovery is the most expensive, time-demanding, and challenging project in biopharmaceutical companies which aims at the identification and optimization of lead compounds from large-sized chemical libraries. The lead compounds should have high-affinity binding and specificity for a target associated with a disease, and, in addition, they should have favorable pharmacodynamic and pharmacokinetic properties (grouped as ADMET properties). Overall, drug discovery is a multivariable optimization and can be carried out in supercomputers using a reliable scoring function which is a measure of binding affinity or inhibition potential of the drug-like compound. The major problem is that the number of compounds in the chemical spaces is huge, making the computational drug discovery very demanding. However, it is cheaper and less time-consuming when compared to experimental high-throughput screening. As the problem is to find the most stable (global) minima for numerous protein-ligand complexes (on the order of 10(6) to 10(12)), the parallel implementation of in silico virtual screening can be exploited to ensure drug discovery in affordable time. In this review, we discuss such implementations of parallelization algorithms in virtual screening programs. The nature of different scoring functions and search algorithms are discussed, together with a performance analysis of several docking softwares ported on high-performance computing architectures.

  • 32.
    Podobas, Artur
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Accelerating Parallel Computations with OpenMP-Driven System-on-Chip Generation for FPGAs2014In: Embedded Multicore/Manycore SoCs (MCSoc), 2014 IEEE 8th International Symposium on, IEEE conference proceedings, 2014, p. 149-156Conference paper (Refereed)
    Abstract [en]

    The task-based programming paradigm offers a portable way of writing parallel applications. However, it requires tedious tuning of the application for performance. We present a novel design flow where programmers can use application knowledge to easily generate a System-on-Chip (SoC) specialized in executing the application. Our design flow uses a compiler that automatically generates task-specific cores and packs them into a custom SoC. A SoC-specific runtime systems schedules tasks on cores to accelerate application execution. The generated SoC shows up to 6000 times performance improvement in comparison to the Altera NiosII/s processor and up to 7 times compared to an AMD Opteron 6172 core. Our design flow helps programmers generate high-performance systems without requiring tuning and prior hardware design knowledge.

    Download full text (pdf)
    fulltext
  • 33.
    Podobas, Artur
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Improving Performance and Quality-of-Service through the Task-Parallel Model​: Optimizations and Future Directions for OpenMP2015Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    With the failure of Dennard's scaling, which stated that shrinking transistors will be more power-efficient, computer hardware has today become very divergent. Initially the change only concerned the number of processor on a chip (multicores), but has today further escalated into complex heterogeneous system with non-intuitive properties -- properties that can improve performance and power consumption but also strain the programmer expected to develop on them.

    Answering these challenges is the OpenMP task-parallel model -- a programming model that simplifies writing parallel software. Our focus in the thesis has been to explore performance and quality-of-service directions of the OpenMP task-parallel model, particularly by taking architectural features into account.

    The first question tackled is: what capabilities does existing state of the art runtime-systems have and how do they perform? We empirically evaluated the performance of several modern task-parallel runtime-systems. Performance and power-consumption was measured through the use of benchmarks and we show that the two primary causes for bottlenecks in modern runtime-systems lies in either the task management overheads or how tasks are being distributed across processors.

    Next, we consider quality-of-service improvements in task-parallel runtime-systems. Striving to improve execution performance, current state of the art runtime-systems seldom take dynamic architectural features such as temperature into account when deciding how work should be distributed across the processors, which can lead to overheating. We developed and evaluated two strategies for thermal-awareness in task-parallel runtime-systems. The first improves performance when the computer system is constrained by temperature while the second strategy strives to reduce temperature while meeting soft real-time objectives.

    We end the thesis by focusing on performance. Here we introduce our original contribution called BLYSK -- a prototype OpenMP framework created exclusively for performance research.

    We found that overheads in current runtime-systems can be expensive, which often lead to performance degradation. We introduce a novel way of preserving task-graphs throughout application runs: task-graphs are recorded, identified and optimized the first time an OpenMP application is executed and are later re-used in following executions, removing unnecessary overheads. Our proposed solution can nearly double the performance compared with other state of the art runtime-systems.

    Performance can also be improved through heterogeneity. Today, manufacturers are placing processors with different capabilities on the same chip. Because they are different, their power-consuming characteristics and performance differ. Heterogeneity adds another dimension to the multiprocessing problem: how should work be distributed across the heterogeneous processors?We evaluated the performance of existing, homogeneous scheduling algorithms and found them to be an ill-match for heterogeneous systems. We proposed a novel scheduling algorithm that dynamically adjusts itself to the heterogeneous system in order to improve performance.

    The thesis ends with a high-level synthesis approach to improve performance in task-parallel applications. Rather than limiting ourselves to off-the-shelf processors -- which often contains a large amount of unused logic -- our approach is to automatically generate the processors ourselves. Our method allows us to generate application-specific hardware from the OpenMP task-parallel source code. Evaluated using FPGAs, the performance of our System-on-Chips outperformed other soft-cores such as the NiosII processor and were also comparable in performance with modern state of the art processors such as the Xeon PHI and the AMD Opteron.

    Download full text (pdf)
    Thesis
  • 34.
    Podobas, Artur
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Performance-driven exploration using Task-based Parallel Programming Frameworks2013Licentiate thesis, comprehensive summary (Other academic)
    Download full text (pdf)
    podobas_lic_summary
  • 35.
    Podobas, Artur
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Q2Logic: A Coarse-Grained FPGA Overlay targeting Schrödinger Quantum Circuit Simulations2023In: 2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023, Institute of Electrical and Electronics Engineers (IEEE) , 2023, p. 460-467Conference paper (Refereed)
    Abstract [en]

    Quantum computing is emerging as an important (but radical) technology that might take us beyond Moore's law for certain applications. Today, in parallel with improving quantum computers, computer scientists are relying heavily on quantum circuit simulators to develop algorithms. Most existing quantum circuit simulators run on general-purpose CPUs or GPUs. However, at the same time, quantum circuits themselves offer multiple opportunities for parallelization, some of which could map better to other architecture- architectures such as reconfigurable systems. In this early work, we created a quantum circuit simulator system called Q2Logic. Q2Logic is a coarse-grained reconfigurable architecture (CGRA) implemented as an overlay on Field-Programmable Gate Arrays (FPGAs), but specialized towards quantum simulations. We described how Q2Logic has been created and reveal implementation details, limitations, and opportunities. We end the study by empirically comparing the performance of Q2Logic (running on a Intel Agilex FPGA) against the state-of-the-art framework SVSim (running on a modern processor), showing improvements in three large circuits(#qbit≥27), where Q2Logic can be up-to 7x faster.

  • 36.
    Podobas, Artur
    et al.
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    A Comparison of some recent Task-based Parallel Programming Models2010In: Proceedings of the 3rd Workshop on Programmability Issues for Multi-Core Computers, (MULTIPROG'2010), Jan 2010, Pisa, 2010Conference paper (Refereed)
    Abstract [en]

    The need for parallel programming models that are simple to use and at the same time efficient for current ant future parallel platforms has led to recent attention to task-based models such as Cilk++, Intel TBB and the task concept in OpenMP version 3.0. The choice of model and implementation can have a major impact on the final performance and in order to understand some of the trade-offs we have made a quantitative study comparing four implementations of OpenMP (gcc, Intel icc, Sun studio and the research compiler Mercurium/nanos mcc), Cilk++ and Wool, a high-performance task-based library developed at SICS. Abstract. We use microbenchmarks to characterize costs for task-creation and stealing and the Barcelona OpenMP Tasks Suite for characterizing application performance. By far Wool and Cilk++ have the lowest overhead in both spawning and stealing tasks. This is reflected in application performance when many tasks with small granularity are spawned where Cilk++ and, in particular, has the highest performance. For coarse granularity applications, the OpenMP implementations have quite similar performance as the more light-weight Cilk++ and Wool except for one application where mcc is superior thanks to a superior task scheduler. Abstract. The OpenMP implemenations are generally not yet ready for use when the task granularity becomes very small. There is no inherent reason for this, so we expect future implementations of OpenMP to focus on this issue.

    Download full text (pdf)
    Podobas-Multiprog'2010
  • 37.
    Podobas, Artur
    et al.
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Architecture-aware Task-scheduling: A thermal approach2011In: http://faspp.ac.upc.edu/faspp11/, 2011Conference paper (Refereed)
    Abstract [en]

    Current task-centric many-core schedulers share a “naive” view of processor architecture; a view that does not care about its thermal, architectural or power consuming properties. Future processor will be more heterogeneous than what we see today, and following Moore’s law of transistor doubling, we foresee an increase in power consumption and thus temperature.

    Thermal stress can induce errors in processors, and so a common way to counter this is by slowing the processor down; something task-centric schedulers should strive to avoid. The Thermal-Task-Interleaving scheduling algorithm proposed in this paper takes both the application temperature behavior and architecture into account when making decisions. We show that for a mixed workload, our scheduler outperforms some of the standard, architecture-unaware scheduling solutions existing today.

    Download full text (pdf)
    fulltext
  • 38.
    Podobas, Artur
    et al.
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Cool-Cores: Thermal-aware Task Scheduling for OpenMP2010Conference paper (Refereed)
    Abstract [en]

    Temperature remains a limiting factor in currentmany-core chips. This work focuses on evaluating different user-mode, task-scheduler’s temperature behaviour. We model a CMPwith 16 cores connected through a mesh interconnect withdirectory based cache coherence. This type of system closelyresembles some of the manycore architecture out on the market.The alghorithms we have investigated are two commonOpenMP scheduling strategies: Breadth-First and Cilk. We alsoimplemented two temperature-aware schedulers based on theBreadth-first and Cilk schedulers. We show that by enablingtemperature-awarness in schedulers, the MTTF can drasticallyimprove with insignificant execution performance losses.

    Download full text (pdf)
    mcc-2010-podobas-brorsson
  • 39.
    Podobas, Artur
    et al.
    KTH, School of Information and Communication Technology (ICT).
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT).
    Empowering OpenMP with Automatically Generated Hardware2016In: International Conference on Embedded Computer Systems: Architectures, MOdeling and Simulation, 2016Conference paper (Refereed)
    Abstract [en]

    OpenMP enables productive software development that targets shared-memory general purpose systems. However, OpenMP compilers today have little support for future heterogeneous systems – systems that will more than likely contain Field Programmable Gate Arrays (FPGAs) to compensate for the lack of parallelism available in general purpose systems. We have designed a high-level synthesis flow that automatically generates parallel hardware from unmodified OpenMP programs. The generated hardware is composed of accelerators tailored to act as hardware instances of the OpenMP task primitive. We drive decision making of complex details within accelerators through a constraint-programming model, minimizing the expected input from the (often) hardware-oblivious software developer. We evaluate our system and compare them to two state of the art architectures – the Xeon PHI and the AMD Opteron – where we find our accelerators to perform on par with the two ASIC processors.

    Download full text (pdf)
    fulltext
  • 40.
    Podobas, Artur
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    From software to parallel hardware through the OpenMP programming modelManuscript (preprint) (Other academic)
  • 41.
    Podobas, Artur
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Faxén, Karl-Filip
    Swedish Institute of Computer Science.
    A comparative performance study of common and popular task-centric programming frameworks2013In: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634Article in journal (Refereed)
    Abstract [en]

    SUMMARY: Programmers today face a bewildering array of parallel programming models and tools, making it difficult to choose an appropriate one for each application. An increasingly popular programming model supporting structured parallel programming patterns in a portable and composable manner is the task-centric programming model. In this study, we compare several popular task-centric programming frameworks, including Cilk Plus, Threading Building Blocks, and various implementations of OpenMP 3.0. We have analyzed their performance on the Barcelona OpenMP Tasking Suite benchmark suite both on a 48-core AMD Opteron 6172 server and a 64-core TILEPro64 embedded many-core processor. Our results show that the OpenMP offers the highest flexibility for programmers, and this flexibility comes to a cost. Frameworks supporting only a specific and more restrictive model, such as Cilk Plus and Threading Building Blocks, are generally more efficient both in terms of performance and energy consumption. However, Intel's implementation of OpenMP tasks performs the best and closest to the specialized run-time systems.

  • 42.
    Podobas, Artur
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Faxén, Karl-Filip
    SICS.
    A Comparative Performane Study of Common and Popular Task-centric Programming FrameworksArticle in journal (Other academic)
  • 43.
    Podobas, Artur
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Faxén, Karl-Filip
    Swedish Institute of Computer Science, SICS.
    A Quantitative Evaluation of popular Task-Centric Programming Models and Libraries2012Report (Other academic)
    Abstract [en]

    Programmers today face a bewildering array ofparallel programming models and tools, making it difficult tochoose an appropriate one for each application. The presentstudy focuses on the task centric approach and compares severalpopular systems, including Cilk Plus, TBB and various imple-mentations of OpenMP 3.0. We analyse their performance on theBOTS benchmark suite both on a 48 core Magny Cours serverand a 64 core TILEPro64 embedded manycore processor.

    Download full text (pdf)
    pdbas-tc-comp
  • 44.
    Podobas, Artur
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Vlassov, Vladimir
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Exploring heterogeneous scheduling using the task-centric programming model2013In: Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349, Vol. 7640Article in journal (Refereed)
    Abstract [en]

    Computer architecture technology is moving towards more heteroge-neous solutions, which will contain a number of processing units with different capabilities that may increase the performance of the system as a whole. How-ever, with increased performance comes increased complexity; complexity that is now barely handled in homogeneous multiprocessing systems. The present study tries to solve a small piece of the heterogeneous puzzle; how can we exploit all system resources in a performance-effective and user-friendly way? Our proposed solution includes a run-time system capable of using a variety of different heterogeneous components while providing the user with the already familiar task-centric programming model interface. Furthermore, when dealing with non-uniform workloads, we show that traditional approaches based on centralized or work-stealing queue algorithms do not work well and propose a scheduling algorithm based on trend analysis to distribute work in a performance-effective way across resources.

    Download full text (pdf)
    fulltext
  • 45.
    Podobas, Artur
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Vlassov, Vladimir
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    TurboBŁYSK: Scheduling for improved data-driven task performance with fast dependency resolution2014In: Using and Improving OpenMP for Devices, Tasks, and More: 10th International Workshop on OpenMP, IWOMP 2014, Salvador, Brazil, September 28-30, 2014. Proceedings, Springer, 2014, p. 45-57Conference paper (Refereed)
    Abstract [en]

    Data-driven task-parallelism is attracting growing interest and has now been added to OpenMP (4.0). This paradigm simplifies the writing of parallel applications, extracting parallelism, and facilitates the use of distributed memory architectures. While the programming model itself is becoming mature, a problem with current run-time scheduler implementations is that they require a very large task granularity in order to scale. This limitation goes at odds with the idea of task-parallel programing where programmers should be able to concentrate on exposing parallelism with little regard to the task granularity. To mitigate this limitation, we have designed and implemented TurboBŁYSK, a highly efficient run-time scheduler of tasks with explicit data-dependence annotations. We propose a novel mechanism based on pattern-saving that allows the scheduler to re-use previously resolved dependency patterns, based on programmer annotations, enabling programs to use even the smallest of tasks and scale well. We experimentally show that our techniques in TurboBŁYSK enable achieving nearly twice the peak performance compared with other run-time schedulers. Our techniques are not OpenMP specific and can be implemented in other task-parallel frameworks.

    Download full text (pdf)
    fulltext
  • 46.
    Podobas, Artur
    et al.
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Communication: Services and Infrastucture, Software and Computer Systems, SCS.
    Vlassov, Vladimir
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    Chi Ching, Chi
    Technische Universität Berlin.
    Juurlink, Ben
    Technische Universität Berlin.
    Considering Quality-of-Service for Resource Reduction using OpenMP2014Conference paper (Refereed)
    Abstract [en]

    Not caring about resources means wasting them. Current task-based parallel models such as Cilk or OpenMP care only about execution performance regardless of the actual application resource needs; this can lead to over-consumption resulting in resource waste.We present a technique to overcome the resource un-awareness by extending the programming model and run-time system to dynamically adapt the allocated resources to reect the expected Quality-of-Service of the application.

    We show that by considering tasks' timing constraints and the expected quality-of-service in terms of real-time behavior, one can reduce the number of resources and temperature compared to a greedy work-stealing scheduler. Our implementation uses a feedback controller that continuously samples the application-experienced service and dynamically adjusts the number of resources to match the quality required by the application.

    Download full text (pdf)
    fulltext
  • 47.
    Podobas, Artur
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Drozd, Alexandr
    Riken, Center for Computational Science, Japan.
    Deverux, Barry
    Queen's University, Belfast, United Kingdom.
    Schuman, Catherine
    University of Tennessee, United States.
    The First International Workshop on COmputing using EmeRging EXotic AI-Inspired Systems (CORtEX'22)2022In: Proceedinga 36th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022, Institute of Electrical and Electronics Engineers (IEEE) , 2022, p. 1235-1236Conference paper (Other academic)
    Abstract [en]

    Presents the message from the conference chairs.

  • 48.
    Podobas, Artur
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).
    Sano, Kentaro
    Riken Center for Computational Science, Japan.
    Anderson, Jason
    University of Toronto, Canada.
    The First International Workshop on Coarse-Grained Reconfigurable Architectures for High-Performance Computing (CGRA4HPC)2022In: Proceedings  IEEE International Parallel and Distributed Processing Symposium, IPDPS Workshops 2022, Institute of Electrical and Electronics Engineers (IEEE) , 2022, p. 625-626Conference paper (Other academic)
    Abstract [en]

    Welcome to the First International Workshop on Coarse-Grained Reconfigurable Architectures for High-Performance Computing (CGRA4HPC), held in conjunction with IPDPS 2022.

  • 49.
    Podobas, Artur
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). RIKEN, Ctr Computat Sci, Kobe, Hyogo 6500047, Japan..
    Sano, Kentaro
    RIKEN, Ctr Computat Sci, Kobe, Hyogo 6500047, Japan..
    Matsuoka, Satoshi
    RIKEN, Ctr Computat Sci, Kobe, Hyogo 6500047, Japan.;Tokyo Inst Technol, Dept Math & Comp Sci, Tokyo 1528550, Japan..
    A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective2020In: IEEE Access, E-ISSN 2169-3536, Vol. 8, p. 146719-146743Article in journal (Refereed)
    Abstract [en]

    With the end of both Dennard's scaling and Moore's law, computer users and researchers are aggressively exploring alternative forms of computing in order to continue the performance scaling that we have come to enjoy. Among the more salient and practical of the post-Moore alternatives are reconfigurable systems, with Coarse-Grained Reconfigurable Architectures (CGRAs) seemingly capable of striking a balance between performance and programmability. In this paper, we survey the landscape of CGRAs. We summarize nearly three decades of literature on the subject, with a particular focus on the premise behind the different CGRAs and how they have evolved. Next, we compile metrics of available CGRAs and analyze their performance properties in order to understand and discover knowledge gaps and opportunities for future CGRA research specialized towards High-Performance Computing (HPC). We find that there are ample opportunities for future research on CGRAs, in particular with respect to size, functionality, support for parallel programming models, and to evaluate more complex applications.

  • 50.
    Podobas, Artur
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST). RIKEN Ctr Computat Sci R CCS, Kobe, Hyogo, Japan..
    Sano, Kentaro
    RIKEN Ctr Computat Sci R CCS, Kobe, Hyogo, Japan..
    Matsuoka, Satoshi
    RIKEN Ctr Computat Sci R CCS, Kobe, Hyogo, Japan.;Tokyo Inst Technol, Tokyo, Japan..
    A template-based framework for exploring coarse-grained reconfigurable architectures2020In: Proceedings 31st IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP) / [ed] Hannig, F Navaridas, J Koch, D Abdelhadi, A, Institute of Electrical and Electronics Engineers (IEEE) , 2020, p. 1-8Conference paper (Refereed)
    Abstract [en]

    Coarse-Grained Reconfigurable Architectures (CGRAs) are being considered as a complementary addition to modern High-Performance Computing (HPC) systems. These reconfigurable devices overcome many of the limitations of the (more popular) FPGA, by providing higher operating frequency, denser compute capacity, and lower power consumption. Today, CGRAs have been used in several embedded applications, including automobile, telecommunication, and mobile systems, but the literature on CGRAs in HPC is sparse and the field full of research opportunities. In this work, we introduce our CGRA simulator infrastructure for use in evaluating future HPC CGRA systems. Our CGRA simulator is built on synthesizable VHDL and is highly parametrizable, including support for connectivity, SIMD, data-type width, and heterogeneity. Unlike other related work, our framework supports co-integration with third-party memory simulators or evaluation of future memory architecture, which is crucial to reason around memory-bound applications. We demonstrate how our framework can be used to explore the performance of multiple different kernels, showing the impact of different configuration and design-space options.

12 1 - 50 of 54
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf