kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 12) Show all publications
Williams, J. J., Liu, F., Trilaksono, J., Tskhakaya, D., Costea, S., Kos, L., . . . Markidis, S. (2025). Accelerating Particle-in-Cell Monte Carlo Simulations with MPI, OpenMP/OpenACC and Asynchronous Multi-GPU Programming. Journal of Computational Science, 88, Article ID 102590.
Open this publication in new window or tab >>Accelerating Particle-in-Cell Monte Carlo Simulations with MPI, OpenMP/OpenACC and Asynchronous Multi-GPU Programming
Show others...
2025 (English)In: Journal of Computational Science, ISSN 1877-7503, E-ISSN 1877-7511, Vol. 88, article id 102590Article in journal (Refereed) Published
Abstract [en]

As fusion energy devices advance, plasma simulations play a critical role in fusion reactor design. Particle-in-Cell Monte Carlo simulations are essential for modelling plasma-material interactions and analysing power load distributions on tokamak divertors. Previous work introduced hybrid parallelization in BIT1 using MPI and OpenMP/OpenACC for shared-memory and multicore CPU processing. In this extended work, we integrate MPI with OpenMP and OpenACC, focusing on asynchronous multi-GPU programming with OpenMP Target Tasks using the "nowait" and "depend" clauses, and OpenACC Parallel with the "async(n)" clause. Our results show significant performance improvements: 16 MPI ranks plus OpenMP threads reduced simulation runtime by 53% on a petascale EuroHPC supercomputer, while the OpenACC multicore implementation achieved a 58% reduction compared to the MPI-only version. Scaling to 64 MPI ranks, OpenACC outperformed OpenMP, achieving a 24% improvement in the particle mover function. On the HPE Cray EX supercomputer, OpenMP and OpenACC consistently reduced simulation times, with a 37% reduction at 100 nodes. Results from MareNostrum 5, a pre-exascale EuroHPC supercomputer, highlight OpenACC's effectiveness, with the "async(n)" configuration delivering notable performance gains. However, OpenMP asynchronous configurations outperform OpenACC at larger node counts, particularly for extreme scaling runs. As BIT1 scales asynchronously to 128 GPUs, OpenMP asynchronous multi-GPU configurations outperformed OpenACC in runtime, demonstrating superior scalability, which continues up to 400 GPUs, further improving runtime. Speedup and parallel efficiency (PE) studies reveal OpenMP asynchronous multi-GPU achieving an 8.77x speedup (54.81% PE) and OpenACC achieving an 8.14x speedup (50.87% PE) on MareNostrum 5, surpassing the CPU-only version. At higher node counts, PE declined across all implementations due to communication and synchronization costs. However, the asynchronous multi-GPU versions maintained better PE, demonstrating the benefits of asynchronous multi-GPU execution in reducing scalability bottlenecks. While the CPU-only implementation is faster in some cases, OpenMP's asynchronous multi-GPU approach delivers better GPU performance through asynchronous data transfer and task dependencies, ensuring data consistency and avoiding race conditions. Using NVIDIA Nsight tools, we confirmed BIT1's overall efficiency for large-scale plasma simulations, leveraging current and future exascale supercomputing infrastructures. Asynchronous data transfers and dedicated GPU assignments to MPI ranks enhance performance, with OpenMP’s asynchronous multi-GPU implementation utilizing OpenMP Target Tasks with "nowait" and "depend" clauses outperforming other configurations. This makes OpenMP the preferred application programming interface when performance portability, high throughput, and efficient GPU utilization are critical. This enables BIT1 to fully exploit modern supercomputing architectures, advancing fusion energy research. MareNostrum 5 brings us closer to achieving exascale performance.

Place, publisher, year, edition, pages
Netherlands: Elsevier BV, 2025
Keywords
Hybrid Programming, OpenMP, Task-Based Parallelism, Dependency Management, OpenACC, Asynchronous Execution, Multi-GPU Offloading, Overlapping Kernels, Large-Scale PIC Simulations
National Category
Computer Systems Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-362742 (URN)10.1016/j.jocs.2025.102590 (DOI)2-s2.0-105003577843 (Scopus ID)
Funder
Swedish Research Council, 2022-06725KTH Royal Institute of Technology, 101093261
Note

QC 20250425

Available from: 2025-04-24 Created: 2025-04-24 Last updated: 2025-05-27Bibliographically approved
Andersson, M., Liu, F. & Markidis, S. (2024). Anderson Accelerated PMHSS for Complex-Symmetric Linear Systems. In: 2024 SIAM Conference on Parallel Processing for Scientific Computing, PP 2024: . Paper presented at 22nd SIAM Conference on Parallel Processing for Scientific Computing, PP 2024, Baltimore, United States of America, Mar 5 2024 - Mar 8 2024 (pp. 39-51). Society for Industrial and Applied Mathematics Publications
Open this publication in new window or tab >>Anderson Accelerated PMHSS for Complex-Symmetric Linear Systems
2024 (English)In: 2024 SIAM Conference on Parallel Processing for Scientific Computing, PP 2024, Society for Industrial and Applied Mathematics Publications , 2024, p. 39-51Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents the design and development of an Anderson Accelerated Preconditioned Modified Hermitian and Skew-Hermitian Splitting (AA-PMHSS) method for solving complex-symmetric linear systems with application to electromagnetics problems, such as wave scattering and eddy currents. While it has been shown that the Anderson acceleration of real linear systems is essentially equivalent to GMRES, we show here that the formulation using Anderson acceleration leads to a more performant method. We show relatively good robustness compared to existing preconditioned GMRES methods and significantly better performance due to the faster evaluation of the preconditioner. In particular, AA-PMHSS can be applied to solve problems and equations arising from complex-valued systems, such as time-harmonic eddy current simulations discretized with the Finite Element Method. We also evaluate three test systems present in previous literature. We show that the method is competitive with two types of preconditioned GMRES, which share the significant advantage of having a convergence rate that is independent of the discretization size.

Place, publisher, year, edition, pages
Society for Industrial and Applied Mathematics Publications, 2024
National Category
Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-347317 (URN)001282154500004 ()2-s2.0-85194149187 (Scopus ID)
Conference
22nd SIAM Conference on Parallel Processing for Scientific Computing, PP 2024, Baltimore, United States of America, Mar 5 2024 - Mar 8 2024
Note

QC 20240612

Part of ISBN 978-171389347-9

Available from: 2024-06-10 Created: 2024-06-10 Last updated: 2024-09-10Bibliographically approved
Liu, F. (2024). High Performance Computing for the Optimization of Radiation Therapy Treatment Plans. (Doctoral dissertation). Stockholm: KTH Royal Institute of Technology
Open this publication in new window or tab >>High Performance Computing for the Optimization of Radiation Therapy Treatment Plans
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Radiation therapy is a clinical field in which computer simulations play a crucial role. Before patients undergo radiation therapy, an individual treatment plan for each patient needs to be created based on the specifics of their case (a process often referred to as treatment planning). The main aspect of this is setting the control parameters for the treatment machine so that the radioactive dose delivered to the patient is as concentrated to the tumor volume as possible. The inverse problem of determining such appropriate control parameters is typically formulated as an optimization problem, which, considering the complexity of modern treatment machines, requires computerized algorithms to solve accurately. Solving this optimization problem can be a key computational bottleneck in the treatment planning workflow. In many cases, finding a suitable treatment plan is a trial-and-error process, requiring multiple solutions of the optimization problems with different weights and parameters.

This thesis proposes different methods to enable the use of high performance computing (HPC) hardware to accelerate the optimization process in radiation therapy. We deal with two main computational aspects of the optimization workflow, the calculation of dose, gradients and objective functions; as well as the optimization solver itself. For dose calculation during optimization, we propose a CUDA kernel for sparse matrix-vector products tuned to dose deposition matrices from proton therapy. For the evaluation of the objective function -- which is often constructed as a weighted sum -- we develop a method to distribute the calculation of the objective function and its gradient across computational nodes using message passing.

For the optimization solver itself we first propose a task-based parallel implementation for Cholesky factorization of banded matrices. This can be an important kernel in interior point methods (IPM) for optimization, depending on the structure of the optimization constraints. The final two papers in this thesis deal with the adoption of iterative linear algebra in IPMs to enable GPU acceleration of the optimization solve. We develop an IPM using a doubly augmented formulation of the KKT linear system and Jacobi preconditioned conjugate gradient method, which we show can solve problems to acceptable accuracy as well as benefit substantially from GPU acceleration. Benchmarks against the commercial treatment planning system RayStation shows that our solver can improve performance by up to 4.4 times on some realistic cases.

Abstract [sv]

Strålterapi är en klinisk gren där datorsimuleringar spelar en viktig roll. Innan patienters strålbehandlingar påbörjas behövs individuella behandlingsplaner skapas baserat på varje patientfalls specifika omständigheter (en process som ofta kallas dosplanering (eng. treament planning)). Huvudaspekten av denna process är att bestämma styrparametrar för behandlingsmaskinen så att stråldosen som levereras till patienten är så koncentrerad till tumören som möjligt. Det inversa problemet att bestämma sådana styrparametrar formuleras typiskt som ett optimeringsproblem, vilket, givet hur komplexa moderna behandlingsmaskiner är, kräver datoralgoritmer för att lösas precist. Att lösa detta optimeringsproblem kan vara en viktig flaskhals i dosplaneringsprocessen. I många fall är det en iterativ process att hitta en lämplig behandlingsplan som kräver att man testar och löser optimeringsproblemet med flera olika vikter och parametrar.

I denna avhandling studerar vi metoder för att möjliggöra användandet av hårdvara från högprestandaberäkningar för att accelerera optimeringsprocessen i strålterapi. Vi angriper två olika beräkningsaspekter i lösningen av optimeringsproblemet: beräkningen av stråldos, gradienter och målfunktioner, samt lösningen av optimeringsproblemet i sig. För dosberäkningen som krävs under optimeringen utvecklar vi en beräkningsmetod i CUDA för matris-vektor produkter med glesa matriser som är specifikt anpassad för dosmatriser (eng. dose deposition matrices) från protonterapi. För beräkningen av målfunktionen --- som i många fall är formulerad som en viktad summa --- utvecklar vi en metod för att distribuera beräkningen av målfunktionen och dess gradient över flera beräkningsnoder genom message passing.

För optimeringslösaren själv så utvecklar vi först en parallel implementation av Cholesky-faktorisering för bandade matriser med parallelisering baserad på tasks. Just den beräkningsalgoritmen kan vara viktig för många inrepunktsmetoder för optimering, baserat på strukturen av bivillkoren i optimeringsproblemet. De sista två publikationerna i avhandlingen studerar användandet av iterativa linjära lösare i inrepunktsmetoder för att möjliggöra GPU-accelerering av av optimeringslösaren. Vi utvecklar en inrepunksmetod med en dubbelaugmenterad formulering av KKT-systemet och en Jacobi-förkonditionerad CG-lösare (eng. conjugate gradient method). Vi demonstrerar att våran metod kan lösa optimeringsproblemen till tillräcklig noggranhet och dessutom ha stor nytta av GPU-acceleration. Prestandajämförelser med det kommersiella dosplaneringssystemet RayStation visar att våran lösare kan förbättra lösningstiden med upp till 4.4 gånger på några realistiska patientfall.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2024. p. 73
Series
TRITA-EECS-AVL ; 2024:35
Keywords
HPC, Radiation Therapy, Optimization, Numerical Linear Algebra, Interior Point Methods
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-345836 (URN)978-91-8040-907-0 (ISBN)
Public defence
2024-05-22, https://kth-se.zoom.us/j/69603435226, F3, Lindstedtsvägen 26, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20140423

Available from: 2024-04-23 Created: 2024-04-22 Last updated: 2024-04-29Bibliographically approved
Liu, F., Fredriksson, A. & Markidis, S. (2024). Krylov Solvers for Interior Point Methods with Applications in Radiation Therapy and Support Vector Machines. In: Computational Science – ICCS 2024 - 24th International Conference, 2024, Proceedings: . Paper presented at 24th International Conference on Computational Science, ICCS 2024, Malaga, Spain, Jul 2 2024 - Jul 4 2024 (pp. 63-77). Springer Nature
Open this publication in new window or tab >>Krylov Solvers for Interior Point Methods with Applications in Radiation Therapy and Support Vector Machines
2024 (English)In: Computational Science – ICCS 2024 - 24th International Conference, 2024, Proceedings, Springer Nature , 2024, p. 63-77Conference paper, Published paper (Refereed)
Abstract [en]

Interior point methods are widely used for different types of mathematical optimization problems. Many implementations of interior point methods in use today rely on direct linear solvers to solve systems of equations in each iteration. The need to solve ever larger optimization problems more efficiently and the rise of hardware accelerators for general purpose computing has led to a large interest in using iterative linear solvers instead, with the major issue being inevitable ill-conditioning of the linear systems arising as the optimization progresses. We investigate the use of Krylov solvers for interior point methods in solving optimization problems from radiation therapy and support vector machines. We implement a prototype interior point method using a so called doubly augmented formulation of the Karush-Kuhn-Tucker linear system of equations, originally proposed by Forsgren and Gill, and evaluate its performance on real optimization problems from radiation therapy and support vector machines. Crucially, our implementation uses a preconditioned conjugate gradient method with Jacobi preconditioning internally. Our measurements of the conditioning of the linear systems indicate that the Jacobi preconditioner improves the conditioning of the systems to a degree that they can be solved iteratively, but there is room for further improvement in that regard. Furthermore, profiling of our prototype code shows that it is suitable for GPU acceleration, which may further improve its performance in practice. Overall, our results indicate that our method can find solutions of acceptable accuracy in reasonable time, even with a simple Jacobi preconditioner.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
Interior point method, Krylov solver, Radiation therapy, Support Vector Machines
National Category
Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-351764 (URN)10.1007/978-3-031-63749-0_5 (DOI)001279316700005 ()2-s2.0-85199601918 (Scopus ID)
Conference
24th International Conference on Computational Science, ICCS 2024, Malaga, Spain, Jul 2 2024 - Jul 4 2024
Note

Part of ISBN [9783031637483]

QC 20240820

Available from: 2024-08-13 Created: 2024-08-13 Last updated: 2024-09-10Bibliographically approved
Williams, J. J., Liu, F., Tskhakaya, D., Costea, S., Podolnik, A. & Markidis, S. (2024). Optimizing BIT1, a Particle-in-Cell Monte Carlo Code, with OpenMP/OpenACC and GPU Acceleration. In: 24th International Conference on Computational Science, Málaga, Spain, July 2-4, Part I, LNCS 14832: . Paper presented at 24th International Conference on Computational Science, Málaga, Spain, July 2-4. Springer Nature
Open this publication in new window or tab >>Optimizing BIT1, a Particle-in-Cell Monte Carlo Code, with OpenMP/OpenACC and GPU Acceleration
Show others...
2024 (English)In: 24th International Conference on Computational Science, Málaga, Spain, July 2-4, Part I, LNCS 14832, Springer Nature, 2024Conference paper, Published paper (Refereed)
Abstract [en]

On the path toward developing the first fusion energy devices, plasma simulations have become indispensable tools for supporting the design and development of fusion machines. Among these critical simulation tools, BIT1 is an advanced Particle-in-Cell code with Monte Carlo collisions, specifically designed for modeling plasma-material interaction and, in particular, analyzing the power load distribution on tokamak divertors. The current implementation of BIT1 relies exclusively on MPI for parallel communication and lacks support for GPUs. In this work, we address these limitations by designing and implementing a hybrid, shared-memory version of BIT1 capable of utilizing GPUs. For shared-memory parallelization, we rely on OpenMP and OpenACC, using a task-based approach to mitigate load-imbalance issues in the particle mover. On an HPE Cray EX computing node, we observe an initial performance improvement of approximately 42%, with scalable performance showing an enhancement of about 38% when using 8 MPI ranks. Still relying on OpenMP and OpenACC, we introduce the first version of BIT1 capable of using GPUs. We investigate two different data movement strategies: unified memory and explicit data movement. Overall, we report BIT1 data transfer findings during each PIC cycle. Among BIT1 GPU implementations, we demonstrate performance improvement through concurrent GPU utilization, especially when MPI ranks are assigned to dedicated GPUs. Finally, we analyze the performance of the first BIT1 GPU porting with the NVIDIA Nsight tools to further our understanding of BIT1’s computational efficiency for large-scale plasma simulations, capable of exploiting current supercomputer infrastructures.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
OpenMP, Task-Based Parallelism, OpenACC, Hybrid Programming, GPU Offloading, Large-Scale PIC Simulations
National Category
Computer Sciences Computer Systems Fusion, Plasma and Space Physics
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-346931 (URN)10.1007/978-3-031-63749-0_22 (DOI)001279316700022 ()2-s2.0-85199628891 (Scopus ID)
Conference
24th International Conference on Computational Science, Málaga, Spain, July 2-4
Note

 Part of ISBN 9783031637483

QC 20240529

Available from: 2024-05-24 Created: 2024-05-24 Last updated: 2024-09-10Bibliographically approved
Liu, F., Andersson, M., Fredriksson, A. & Markidis, S. (2023). Distributed Objective Function Evaluation for Optimization of Radiation Therapy Treatment Plans. In: PPAM 2022. Lecture Notes in Computer Science, vol 13826.: . Paper presented at PPAM 2022: Parallel Processing and Applied Mathematics. Springer Nature
Open this publication in new window or tab >>Distributed Objective Function Evaluation for Optimization of Radiation Therapy Treatment Plans
2023 (English)In: PPAM 2022. Lecture Notes in Computer Science, vol 13826., Springer Nature , 2023Conference paper, Published paper (Refereed)
Abstract [en]

The modern workflow for radiation therapy treatment planning involves mathematical optimization to determine optimal treatment machine parameters for each patient case. The optimization problems can be computationally expensive, requiring iterative optimization algorithms to solve. In this work, we investigate a method for distributing the calculation of objective functions and gradients for radiation therapy optimization problems across computational nodes. We test our approach on the TROTS dataset--- which consists of optimization problems from real clinical patient cases---using the IPOPT optimization solver in a leader/follower type approach for parallelization. We show that our approach can utilize multiple computational nodes efficiently, with a speedup of approximately 2-3.5 times compared to the serial version.

Place, publisher, year, edition, pages
Springer Nature, 2023
National Category
Engineering and Technology
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-327020 (URN)10.1007/978-3-031-30442-2_29 (DOI)2-s2.0-85161412381 (Scopus ID)
Conference
PPAM 2022: Parallel Processing and Applied Mathematics
Note

QC 20230523

Available from: 2023-05-17 Created: 2023-05-17 Last updated: 2024-08-28Bibliographically approved
Liu, F., Fredriksson, A. & Markidis, S. (2023). Parallel Cholesky Factorization for Banded Matrices Using OpenMP Tasks. In: Euro-Par 2023: Parallel Processing - 29th International Conference on Parallel and Distributed Computing, Proceedings. Paper presented at 29th International European Conference on Parallel and Distributed Computing, Euro-Par 2023, Limassol, Cyprus, Aug 28 2023 - Sep 1 2023 (pp. 725-739). Springer Nature
Open this publication in new window or tab >>Parallel Cholesky Factorization for Banded Matrices Using OpenMP Tasks
2023 (English)In: Euro-Par 2023: Parallel Processing - 29th International Conference on Parallel and Distributed Computing, Proceedings, Springer Nature , 2023, p. 725-739Conference paper, Published paper (Refereed)
Abstract [en]

Cholesky factorization is a method for solving linear systems involving symmetric, positive-definite matrices, and can be an attractive choice in applications where a high degree of numerical stability is needed. One such application is mathematical optimization, where direct methods for solving linear systems are widely used and often a significant performance bottleneck. An example where this is the case, and the specific type of optimization problem motivating this work, is radiation therapy treatment planning, where mathematical optimization is used to create individual treatment plans for patients. To address this bottleneck, we propose a task-based multi-threaded method for Cholesky factorization of banded matrices with medium-sized bands. We implement our algorithm using OpenMP tasks and compare our performance with state-of-the-art libraries such as Intel MKL. Our performance measurements show a performance that is on par or better than Intel MKL (up to ∼ 26 % on a single CPU socket) for a wide range of matrix bandwidths on two different Intel CPU systems.

Place, publisher, year, edition, pages
Springer Nature, 2023
Keywords
Cholesky factorization, Linear Solver, Task-Based Parallelism
National Category
Computer Systems Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-337877 (URN)10.1007/978-3-031-39698-4_49 (DOI)2-s2.0-85171536146 (Scopus ID)
Conference
29th International European Conference on Parallel and Distributed Computing, Euro-Par 2023, Limassol, Cyprus, Aug 28 2023 - Sep 1 2023
Note

Part of ISBN 9783031396977

QC 20231010

Available from: 2023-10-10 Created: 2023-10-10 Last updated: 2024-04-22Bibliographically approved
Karp, M., Liu, F., Stanly, R., Rezaeiravesh, S., Jansson, N., Schlatter, P. & Markidis, S. (2023). Uncertainty Quantification of Reduced-Precision Time Series in Turbulent Channel Flow. In: Proceedings of 2023 SC Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023: . Paper presented at 2023 International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023, Denver, United States of America, Nov 12 2023 - Nov 17 2023 (pp. 387-390). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Uncertainty Quantification of Reduced-Precision Time Series in Turbulent Channel Flow
Show others...
2023 (English)In: Proceedings of 2023 SC Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023, Association for Computing Machinery (ACM) , 2023, p. 387-390Conference paper, Published paper (Refereed)
Abstract [en]

With increased computational power through the use of arithmetic in low-precision, a relevant question is how lower precision affects simulation results, especially for chaotic systems where analytical round-off estimates are non-trivial to obtain. In this work, we consider how the uncertainty of the time series of a direct numerical simulation of turbulent channel flow at Ret = 180 is affected when restricted to a reduced-precision representation. We utilize a non-overlapping batch means estimator and find that the mean statistics can, in this case, be obtained with significantly fewer mantissa bits than conventional IEEE-754 double precision, but that the mean values are observed to be more sensitive in the middle of the channel than in the near-wall region. This indicates that using lower precision in the near-wall region, where the majority of the computational efforts are required, may benefit from low-precision floating point units found in upcoming computer hardware.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
National Category
Fluid Mechanics
Identifiers
urn:nbn:se:kth:diva-341470 (URN)10.1145/3624062.3624105 (DOI)2-s2.0-85178155242 (Scopus ID)
Conference
2023 International Conference on High Performance Computing, Network, Storage, and Analysis, SC Workshops 2023, Denver, United States of America, Nov 12 2023 - Nov 17 2023
Note

QC 20240109

Part of ISBN 979-840070785-8

Available from: 2024-01-09 Created: 2024-01-09 Last updated: 2025-02-09Bibliographically approved
Liu, F., Fredriksson, A. & Markidis, S. (2022). A survey of HPC algorithms and frameworks for large-scale gradient-based nonlinear optimization. Journal of Supercomputing, 78(16), 17513-17542
Open this publication in new window or tab >>A survey of HPC algorithms and frameworks for large-scale gradient-based nonlinear optimization
2022 (English)In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 78, no 16, p. 17513-17542Article in journal (Refereed) Published
Abstract [en]

Large-scale numerical optimization problems arise from many fields and have applications in both industrial and academic contexts. Finding solutions to such optimization problems efficiently requires algorithms that are able to leverage the increasing parallelism available in modern computing hardware. In this paper, we review previous work on parallelizing algorithms for nonlinear optimization. To introduce the topic, the paper starts by giving an accessible introduction to nonlinear optimization and high-performance computing. This is followed by a survey of previous work on parallelization and utilization of high-performance computing hardware for nonlinear optimization algorithms. Finally, we present a number of optimization software libraries and how they are able to utilize parallel computing today. This study can serve as an introduction point for researchers interested in nonlinear optimization or high-performance computing, as well as provide ideas and inspiration for future work combining these topics. 

Place, publisher, year, edition, pages
Springer Nature, 2022
Keywords
HPC, Interior-point method, Nonlinear optimization, Parallel computing, Nonlinear programming, Computing hardware, Gradient based, Large-scales, Non-linear optimization, Numerical optimizations, Optimization problems, Parallel com- puting, Performance computing, Surveys
National Category
Control Engineering
Identifiers
urn:nbn:se:kth:diva-323843 (URN)10.1007/s11227-022-04555-8 (DOI)000799541400003 ()2-s2.0-85130240510 (Scopus ID)
Note

QC 20230220

Available from: 2023-02-20 Created: 2023-02-20 Last updated: 2024-04-22Bibliographically approved
Liu, F., Jansson, N., Podobas, A., Fredriksson, A. & Markidis, S. (2021). Accelerating Radiation Therapy Dose Calculation with Nvidia GPUs. In: IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW): . Paper presented at 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 17-21 June 2021, Portland, OR, USA. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Accelerating Radiation Therapy Dose Calculation with Nvidia GPUs
Show others...
2021 (English)In: IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Institute of Electrical and Electronics Engineers (IEEE) , 2021Conference paper, Published paper (Refereed)
Abstract [en]

Radiation Treatment Planning (RTP) is the process of planning the appropriate external beam radiotherapy to combat cancer in human patients. RTP is a complex and compute-intensive task, which often takes a long time (several hours) to compute. Reducing this time allows for higher productivity at clinics and more sophisticated treatment planning, which can materialize in better treatments. The state-of-the-art in medical facilities uses general-purpose processors (CPUs) to perform many steps in the RTP process. In this paper, we explore the use of accelerators to reduce RTP calculating time. We focus on the step that calculates the dose using the Graphics Processing Unit (GPU), which we believe is an excellent candidate for this computation type. Next, we create a highly optimized implementation for a custom Sparse Matrix-Vector Multiplication (SpMV) that operates on numerical formats unavailable in state-of-the-art SpMV libraries (e.g., Ginkgo and cuSPARSE). We show that our implementation is several times faster than the baseline (up-to 4x) and has a higher operational intensity than similar (but different) versions such as Ginkgo and cuSPARSE.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-296824 (URN)10.1109/IPDPSW52791.2021.00076 (DOI)000689576200059 ()2-s2.0-85114391525 (Scopus ID)
Conference
2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 17-21 June 2021, Portland, OR, USA
Note

Part of proceedings: ISBN 978-1-6654-3577-2

QC 20230117

Available from: 2021-06-10 Created: 2021-06-10 Last updated: 2024-04-22Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-6865-9379

Search in DiVA

Show all publications