kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 51) Show all publications
He, Y., Podobas, A. & Markidis, S. (2024). Leveraging MLIR for Loop Vectorization and GPU Porting of FFT Libraries. In: Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, Limassol, Cyprus, August 28 – September 1, 2023, Revised Selected Papers: . Paper presented at International workshops held at the 29th International Conference on Parallel and Distributed Computing, Euro-Par 2023, Limassol, Cyprus, Aug 28 2023 - Sep 1 2023 (pp. 207-218). Springer Science and Business Media Deutschland GmbH, 14351
Open this publication in new window or tab >>Leveraging MLIR for Loop Vectorization and GPU Porting of FFT Libraries
2024 (English)In: Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, Limassol, Cyprus, August 28 – September 1, 2023, Revised Selected Papers, Springer Science and Business Media Deutschland GmbH , 2024, Vol. 14351, p. 207-218Conference paper, Published paper (Refereed)
Abstract [en]

FFTc is a Domain-Specific Language (DSL) for designing and generating Fast Fourier Transforms (FFT) libraries. The FFTc uniqueness is that it leverages and extend Multi-Level Intermediate Representation (MLIR) dialects to optimize FFT code generation. In this work, we present FFTc extensions and improvements such as the possibility of using different data layout for complex-value arrays, and sparsification to enable efficient vectorization, and a seamless porting of FFT libraries to GPU systems. We show that, on CPUs, thanks to vectorization, the performance of the FFTc-generated FFT is comparable to performance of FFTW, a state-of-the-art FFT libraries. We also present the initial performance results for FFTc on Nvidia GPUs.

Place, publisher, year, edition, pages
Springer Science and Business Media Deutschland GmbH, 2024
Series
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), ISSN 0302-9743 ; 14351
Keywords
Automatic Loop Vectorization, FFTc, GPU Porting, LLVM, MLIR
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-346538 (URN)10.1007/978-3-031-50684-0_16 (DOI)2-s2.0-85192276218 (Scopus ID)
Conference
International workshops held at the 29th International Conference on Parallel and Distributed Computing, Euro-Par 2023, Limassol, Cyprus, Aug 28 2023 - Sep 1 2023
Note

QC 20240521

Available from: 2024-05-16 Created: 2024-05-16 Last updated: 2024-05-21Bibliographically approved
Jansson, N., Karp, M., Podobas, A., Markidis, S. & Schlatter, P. (2024). Neko: A modern, portable, and scalable framework for high-fidelity computational fluid dynamics. Computers & Fluids, 275, 106243-106243, Article ID 106243.
Open this publication in new window or tab >>Neko: A modern, portable, and scalable framework for high-fidelity computational fluid dynamics
Show others...
2024 (English)In: Computers & Fluids, ISSN 0045-7930, E-ISSN 1879-0747, Vol. 275, p. 106243-106243, article id 106243Article in journal (Refereed) Published
Abstract [en]

Computational fluid dynamics (CFD), in particular applied to turbulent flows, is a research area with great engineering and fundamental physical interest. However, already at moderately high Reynolds numbers the computational cost becomes prohibitive as the range of active spatial and temporal scales is quickly widening. Specifically scale-resolving simulations, including large-eddy simulation (LES) and direct numerical simulations (DNS), thus need to rely on modern efficient numerical methods and corresponding software implementations. Recent trends and advancements, including more diverse and heterogeneous hardware in High-Performance Computing (HPC), are challenging software developers in their pursuit for good performance and numerical stability. The well-known maxim “software outlives hardware” may no longer necessarily hold true, and developers are today forced to re-factor their codebases to leverage these powerful new systems. In this paper, we present Neko, a new portable framework for high-order spectral element discretization, targeting turbulent flows in moderately complex geometries. Neko is fully available as open software. Unlike prior works, Neko adopts a modern object-oriented approach in Fortran 2008, allowing multi-tier abstractions of the solver stack and facilitating hardware backends ranging from general-purpose processors (CPUs) down to exotic vector processors and FPGAs. We show that Neko’s performance and accuracy are comparable to NekRS, and thus on-par with Nek5000’s successor on modern CPU machines. Furthermore, we develop a performance model, which we use to discuss challenges and opportunities for high-order solvers on emerging hardware

Place, publisher, year, edition, pages
Elsevier BV, 2024
National Category
Fluid Mechanics and Acoustics Computational Mathematics Computer Sciences
Identifiers
urn:nbn:se:kth:diva-344896 (URN)10.1016/j.compfluid.2024.106243 (DOI)2-s2.0-85189508362 (Scopus ID)
Funder
Swedish Research Council, 2019-04723EU, Horizon 2020, 823691EU, Horizon 2020, 801039
Note

QC 20240403

Available from: 2024-04-02 Created: 2024-04-02 Last updated: 2024-04-22Bibliographically approved
Domke, J., Vatai, E., Gerofi, B., Kodama, Y., Wahib, M., Podobas, A., . . . Matsuoka, S. (2023). At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads. ACM Transactions on Architecture and Code Optimization (TACO), 20(4), Article ID 57.
Open this publication in new window or tab >>At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads
Show others...
2023 (English)In: ACM Transactions on Architecture and Code Optimization (TACO), ISSN 1544-3566, E-ISSN 1544-3973, Vol. 20, no 4, article id 57Article in journal (Refereed) Published
Abstract [en]

Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of a hypothetical LARge Cache processor (LARC), fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a broad set of proxy-applications and benchmarks, we aim to reveal how HPC CPU performance will evolve, and conclude an average boost of 9.56× for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design.

Place, publisher, year, edition, pages
Association for Computing Machinery, 2023
Keywords
3D-stacked memory, Emerging architecture study, gem5 simulation, proxy-applications
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-342395 (URN)10.1145/3629520 (DOI)001153375300012 ()2-s2.0-85181487217 (Scopus ID)
Note

QC 20240118

Available from: 2024-01-17 Created: 2024-01-17 Last updated: 2024-02-26Bibliographically approved
Andersson, M., Natarajan Arul, M., Podobas, A. & Markidis, S. (2023). Breaking Down the Parallel Performance of GROMACS, a High-Performance Molecular Dynamics Software. In: PPAM 2022. Lecture Notes in Computer Science, vol 13826.: . Paper presented at PPAM 14th INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING AND APPLIED MATHEMATICS (pp. 333-345). Springer Nature
Open this publication in new window or tab >>Breaking Down the Parallel Performance of GROMACS, a High-Performance Molecular Dynamics Software
2023 (English)In: PPAM 2022. Lecture Notes in Computer Science, vol 13826., Springer Nature , 2023, p. 333-345Conference paper, Published paper (Refereed)
Abstract [en]

GROMACS is one of the most widely used HPC software packages using the Molecular Dynamics (MD) simulation technique. In this work, we quantify GROMACS parallel performance using different configurations, HPC systems, and FFT libraries (FFTW, Intel MKL FFT, and FFT PACK). We break down the cost of each GROMACS computational phase and identify non-scalable stages, such as MPI communication during the 3D FFT computation when using a large number of processes. We show that the Particle-Mesh Ewald phase and the 3D FFT calculation significantly impact the GROMACS performance. Finally, we discuss performance opportunities with a particular interest in developing GROMACS for the FFT calculations.

Place, publisher, year, edition, pages
Springer Nature, 2023
Series
Lecture Notes in Computer Science ; 13826
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-326454 (URN)10.1007/978-3-031-30442-2_25 (DOI)
Conference
PPAM 14th INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING AND APPLIED MATHEMATICS
Note

QC 20230515

Available from: 2023-05-02 Created: 2023-05-02 Last updated: 2023-05-22Bibliographically approved
He, Y., Podobas, A., Andersson, M. & Markidis, S. (2023). FFTc: An MLIR Dialect for Developing HPC Fast Fourier Transform Libraries. In: Euro-Par 2022: Parallel Processing Workshops: Euro-Par 2022 International Workshops, Glasgow, UK, August 22–26, 2022, Revised Selected Papers. Paper presented at Euro-Par 2022 International Workshops, Glasgow, UK, August 22–26, 2022 (pp. 80-92).
Open this publication in new window or tab >>FFTc: An MLIR Dialect for Developing HPC Fast Fourier Transform Libraries
2023 (English)In: Euro-Par 2022: Parallel Processing Workshops: Euro-Par 2022 International Workshops, Glasgow, UK, August 22–26, 2022, Revised Selected Papers, 2023, p. 80-92Conference paper, Published paper (Refereed)
Abstract [en]

Discrete Fourier Transform (DFT) libraries are one of the most critical software components for scientific computing. Inspired by FFTW, a widely used library for DFT HPC calculations, we apply compiler technologies for the development of HPC Fourier transform libraries. In this work, we introduce FFTc, a domain-specific language, based on Multi-Level Intermediate Representation (MLIR), for expressing Fourier Transform algorithms. We present the initial design, implementation, and preliminary results of FFTc.

National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-327027 (URN)10.1007/978-3-031-31209-0_6 (DOI)2-s2.0-85161396397 (Scopus ID)
Conference
Euro-Par 2022 International Workshops, Glasgow, UK, August 22–26, 2022
Note

QC 20231122

Available from: 2023-05-17 Created: 2023-05-17 Last updated: 2023-11-22Bibliographically approved
Chien, S. W. .., Sato, K., Podobas, A., Jansson, N., Markidis, S. & Honda, M. (2023). Improving Cloud Storage Network Bandwidth Utilization of Scientific Applications. In: Proceedings of the 7th Asia-Pacific Workshop on Networking, APNET 2023: . Paper presented at 7th Asia-Pacific Workshop on Networking, APNET 2023, Jun 29 - Jun 30 2023, Hong Kong, China, (pp. 172-173). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Improving Cloud Storage Network Bandwidth Utilization of Scientific Applications
Show others...
2023 (English)In: Proceedings of the 7th Asia-Pacific Workshop on Networking, APNET 2023, Association for Computing Machinery (ACM) , 2023, p. 172-173Conference paper, Published paper (Refereed)
Abstract [en]

Cloud providers began to provide managed services to attract scientific applications, which have been traditionally executed on supercomputers. One example is AWS FSx for Lustre, a fully managed parallel file system (PFS) released in 2018. However, due to the nature of scientific applications, the frontend storage network bandwidth is left completely idle for the majority of its lifetime. Furthermore, the pricing model does not match the scalability requirement. We propose iFast, a novel host-side caching mechanism for scientific applications that improves storage bandwidth utilization and end-to-end application performance: by overlapping compute and data writeback through inexpensive local storage. iFast supports the Massage Passing Interface (MPI) library that is widely used by scientific applications and is implemented as a preloaded library. It requires no change to applications, the MPI library, or support from cloud operators. We demonstrate how iFast can accelerate the end-to-end time of a representative scientific application Neko, by 13-40%.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
National Category
Computer Systems Computer Sciences
Identifiers
urn:nbn:se:kth:diva-338993 (URN)10.1145/3600061.3603122 (DOI)001147804500029 ()2-s2.0-85173833099 (Scopus ID)
Conference
7th Asia-Pacific Workshop on Networking, APNET 2023, Jun 29 - Jun 30 2023, Hong Kong, China,
Note

Part of ISBN 9798400707827

QC 20231101

Available from: 2023-11-01 Created: 2023-11-01 Last updated: 2024-02-27Bibliographically approved
Adhi, B., Cortes, C., Sozzo, E. D., Ueno, T., Tan, Y., Kojima, T., . . . Sano, K. (2023). Less for more: reducing intra-cgra connectivity for higher performance and efficiency in hpc. In: 2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023: . Paper presented at 2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023, St. Petersburg, United States of America, May 15 2023 - May 19 2023 (pp. 452-459). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Less for more: reducing intra-cgra connectivity for higher performance and efficiency in hpc
Show others...
2023 (English)In: 2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023, Institute of Electrical and Electronics Engineers (IEEE) , 2023, p. 452-459Conference paper, Published paper (Refereed)
Abstract [en]

Coarse-Grained Reconfigurable Arrays (CGRAs) are a class of reconfigurable architectures that inherit the performance of Domain-specific accelerators and the reconfigurability aspects of Field-Programmable Gate Arrays (FPGAs). Historically, CGRAs have been successfully used to accelerate embedded applications and are now considered to accelerate High-Performance Computing (HPC) applications in future supercomputers. However, embedded systems and supercomputers are two vastly different domains with different applications and constraints, and it is today not fully understood what CGRA design decisions adequately cater to the HPC market. One such unknown design decision is regarding the interconnect that facilitates intra-CGRA communication. Our findings show that even the typical king-style mesh-like topology is often under-utilized with a typical HPC workload, leading to inefficiency. This research aims to explore the provisioning of intra-CGRA interconnect for HPC-oriented workloads and, ultimately, recoup the potential performance and efficiency lost by reducing the interconnect complexity. We proposed several reduced interconnect topologies based on the usage statistic. Then we evaluate the tradeoffs regarding hardware cost, routability of DFGs, and computational throughput.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
CGRA, Design space exploration, HPC, Routing architecture, RTL simulation
National Category
Embedded Systems
Identifiers
urn:nbn:se:kth:diva-336739 (URN)10.1109/IPDPSW59300.2023.00077 (DOI)001055030700056 ()2-s2.0-85169299919 (Scopus ID)
Conference
2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023, St. Petersburg, United States of America, May 15 2023 - May 19 2023
Note

Part of ISBN 9798350311990 

QC 20230919

Available from: 2023-09-19 Created: 2023-09-19 Last updated: 2023-10-02Bibliographically approved
Podobas, A. (2023). Q2Logic: A Coarse-Grained FPGA Overlay targeting Schrödinger Quantum Circuit Simulations. In: 2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023: . Paper presented at 2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023, St. Petersburg, United States of America, May 15 2023 - May 19 2023 (pp. 460-467). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Q2Logic: A Coarse-Grained FPGA Overlay targeting Schrödinger Quantum Circuit Simulations
2023 (English)In: 2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023, Institute of Electrical and Electronics Engineers (IEEE) , 2023, p. 460-467Conference paper, Published paper (Refereed)
Abstract [en]

Quantum computing is emerging as an important (but radical) technology that might take us beyond Moore's law for certain applications. Today, in parallel with improving quantum computers, computer scientists are relying heavily on quantum circuit simulators to develop algorithms. Most existing quantum circuit simulators run on general-purpose CPUs or GPUs. However, at the same time, quantum circuits themselves offer multiple opportunities for parallelization, some of which could map better to other architecture- architectures such as reconfigurable systems. In this early work, we created a quantum circuit simulator system called Q2Logic. Q2Logic is a coarse-grained reconfigurable architecture (CGRA) implemented as an overlay on Field-Programmable Gate Arrays (FPGAs), but specialized towards quantum simulations. We described how Q2Logic has been created and reveal implementation details, limitations, and opportunities. We end the study by empirically comparing the performance of Q2Logic (running on a Intel Agilex FPGA) against the state-of-the-art framework SVSim (running on a modern processor), showing improvements in three large circuits(#qbit≥27), where Q2Logic can be up-to 7x faster.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
CGRA, FPGA, Overlay, Quantum Simulation, State Vector
National Category
Computer Engineering Embedded Systems
Identifiers
urn:nbn:se:kth:diva-336740 (URN)10.1109/IPDPSW59300.2023.00078 (DOI)001055030700057 ()2-s2.0-85169289891 (Scopus ID)
Conference
2023 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2023, St. Petersburg, United States of America, May 15 2023 - May 19 2023
Note

Part of ISBN 9798350311990

QC 20230919

Available from: 2023-09-19 Created: 2023-09-19 Last updated: 2023-10-02Bibliographically approved
Flatken, M., Podobas, A., Chien, W. D., Markidis, S., Gerndt, A. & et al., . (2023). VESTEC: Visual Exploration and Sampling Toolkit for Extreme Computing. IEEE Access, 11, 87805-87834
Open this publication in new window or tab >>VESTEC: Visual Exploration and Sampling Toolkit for Extreme Computing
Show others...
2023 (English)In: IEEE Access, E-ISSN 2169-3536, Vol. 11, p. 87805-87834Article in journal (Refereed) Published
Abstract [en]

Natural disasters and epidemics are unfortunate recurring events that lead to huge societal and economic loss. Recent advances in supercomputing can facilitate simulations of such scenarios in (or even ahead of) real-time, therefore supporting the design of adequate responses by public authorities. By incorporating high-velocity data from sensors and modern high-performance computing systems, ensembles of simulations and advanced analysis enable urgent decision-makers to better monitor the disaster and to employ necessary actions (e.g., to evacuate populated areas) for mitigating these events. Unfortunately, frameworks to support such versatile and complex workflows for urgent decision-making are only rarely available and often lack in functionalities. This paper gives an overview of the VESTEC project and framework, which unifies orchestration, simulation, in-situ data analysis, and visualization of natural disasters that can be driven by external sensor data or interactive intervention by the user. We show how different components interact and work together in VESTEC and describe implementation details. To disseminate our experience three different types of disasters are evaluated: a Wildfire in La Jonquera (Spain), a Mosquito-Borne disease in two regions of Italy, and the magnetic reconnection in the Earth magnetosphere.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
decision making, ensemble simulation, high-performance computing, in-situ processing, interactive data processing, Scientific visualization, topological data analysis
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-338517 (URN)10.1109/ACCESS.2023.3301177 (DOI)001093869300001 ()2-s2.0-85166747303 (Scopus ID)
Note

QC 20231114

Available from: 2023-11-14 Created: 2023-11-14 Last updated: 2024-02-29Bibliographically approved
Karp, M., Podobas, A., Kenter, T., Jansson, N., Plessl, C., Schlatter, P. & Markidis, S. (2022). A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays: Design, Evaluation, and Future Challenges. In: HPCAsia2022: International Conference on High Performance Computing in Asia-Pacific Region: . Paper presented at HPCAsia2022: International Conference on High Performance Computing in Asia-Pacific Region (pp. 125-136). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>A High-Fidelity Flow Solver for Unstructured Meshes on Field-Programmable Gate Arrays: Design, Evaluation, and Future Challenges
Show others...
2022 (English)In: HPCAsia2022: International Conference on High Performance Computing in Asia-Pacific Region, Association for Computing Machinery (ACM) , 2022, p. 125-136Conference paper, Published paper (Refereed)
Abstract [en]

The impending termination of Moore’s law motivates the search for new forms of computing to continue the performance scaling we have grown accustomed to. Among the many emerging Post-Moore computing candidates, perhaps none is as salient as the Field-Programmable Gate Array (FPGA), which offers the means of specializing and customizing the hardware to the computation at hand.

In this work, we design a custom FPGA-based accelerator for a computational fluid dynamics (CFD) code. Unlike prior work – which often focuses on accelerating small kernels – we target the entire Poisson solver on unstructured meshes based on the high-fidelity spectral element method (SEM) used in modern state-of-the-art CFD systems. We model our accelerator using an analytical performance model based on the I/O cost of the algorithm. We empirically evaluate our accelerator on a state-of-the-art Intel Stratix 10 FPGA in terms of performance and power consumption and contrast it against existing solutions on general-purpose processors (CPUs). Finally, we propose a data movement-reducing technique where we compute geometric factors on the fly, which yields significant (700+ Gflop/s) single-precision performance and an upwards of 2x reduction in runtime for the local evaluation of the Laplace operator.

We end the paper by discussing the challenges and opportunities of using reconfigurable architecture in the future, particularly in the light of emerging (not yet available) technologies.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2022
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-309190 (URN)10.1145/3492805.3492808 (DOI)2-s2.0-85122641610 (Scopus ID)
Conference
HPCAsia2022: International Conference on High Performance Computing in Asia-Pacific Region
Note

QC 20220223

Available from: 2022-02-22 Created: 2022-02-22 Last updated: 2024-04-22Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-5452-6794

Search in DiVA

Show all publications