Change search
Link to record
Permanent link

Direct link
BETA
Publications (7 of 7) Show all publications
Wiesenberger, M., Einkemmer, L., Held, M., Gutierrez-Milla, A., Sáez, X. & Iakymchuk, R. (2019). Reproducibility, accuracy and performance of the FELTOR code and library on parallel computer architectures. Computer Physics Communications, 238, 145-156
Open this publication in new window or tab >>Reproducibility, accuracy and performance of the FELTOR code and library on parallel computer architectures
Show others...
2019 (English)In: Computer Physics Communications, ISSN 0010-4655, E-ISSN 1879-2944, Vol. 238, p. 145-156Article in journal (Refereed) Published
Abstract [en]

FELTOR is a modular and free scientific software package. It allows developing platform independent code that runs on a variety of parallel computer architectures ranging from laptop CPUs to multi-GPU distributed memory systems. FELTOR consists of both a numerical library and a collection of application codes built on top of the library. Its main targets are two- and three-dimensional drift- and gyro-fluid simulations with discontinuous Galerkin methods as the main numerical discretization technique. We observe that numerical simulations of a recently developed gyro-fluid model produce non-deterministic results in parallel computations. First, we show how we restore accuracy and bitwise reproducibility algorithmically and programmatically. In particular, we adopt an implementation of the exactly rounded dot product based on long accumulators, which avoids accuracy losses especially in parallel applications. However, reproducibility and accuracy alone fail to indicate correct simulation behavior. In fact, in the physical model slightly different initial conditions lead to vastly different end states. This behavior translates to its numerical representation. Pointwise convergence, even in principle, becomes impossible for long simulation times. We briefly discuss alternative methods to ensure the correctness of results like the convergence of reduced physical quantities of interest, ensemble simulations, invariants or reduced simulation times. In a second part, we explore important performance tuning considerations. We identify latency and memory bandwidth as the main performance indicators of our routines. Based on these, we propose a parallel performance model that predicts the execution time of algorithms implemented in FELTOR and test our model on a selection of parallel hardware architectures. We are able to predict the execution time with a relative error of less than 25% for problem sizes between 10 −1 and 10 3 MB. Finally, we find that the product of latency and bandwidth gives a minimum array size per compute node to achieve a scaling efficiency above 50% (both strong and weak).

Place, publisher, year, edition, pages
Elsevier, 2019
Keywords
Feltor, GPU, High-performance computing, Performance, Reproducibility, Xeon Phi
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-246445 (URN)10.1016/j.cpc.2018.12.006 (DOI)000462802800012 ()2-s2.0-85059328481 (Scopus ID)
Note

QC 20190326

Available from: 2019-03-26 Created: 2019-03-26 Last updated: 2019-04-29Bibliographically approved
Thoman, P., Dichev, K., Heller, T., Iakymchuk, R., Aguilar, X., Hasanov, K., . . . Nikolopoulos, D. S. (2018). A taxonomy of task-based parallel programming technologies for high-performance computing. Journal of Supercomputing, 74(4), 1422-1434
Open this publication in new window or tab >>A taxonomy of task-based parallel programming technologies for high-performance computing
Show others...
2018 (English)In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 74, no 4, p. 1422-1434Article in journal (Refereed) Published
Abstract [en]

Task-based programming models for shared memory-such as Cilk Plus and OpenMP 3-are well established and documented. However, with the increase in parallel, many-core, and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.

Place, publisher, year, edition, pages
SPRINGER, 2018
Keywords
High-performance computing, Task-based parallelism, Taxonomy, API, Runtime system, Scheduler, Monitoring framework, Fault tolerance
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-226199 (URN)10.1007/s11227-018-2238-4 (DOI)000428284000002 ()2-s2.0-85041817729 (Scopus ID)
Note

QC 20180518

Available from: 2018-05-18 Created: 2018-05-18 Last updated: 2018-05-18Bibliographically approved
Thoman, P., Hasanov, K., Dichev, K., Iakymchuk, R., Aguilar, X., Gschwandtner, P., . . . Fahringer, T. (2018). A Taxonomy of Task-Based Technologies for High-Performance Computing. In: Wyrzykowski, R Dongarra, J Deelman, E Karczewski, K (Ed.), PARALLEL PROCESSING AND APPLIED MATHEMATICS (PPAM 2017), PT II: . Paper presented at 12th International Conference on Parallel Processing and Applied Mathematics (PPAM), SEP 10-13, 2017, Lublin, POLAND (pp. 264-274). SPRINGER INTERNATIONAL PUBLISHING AG
Open this publication in new window or tab >>A Taxonomy of Task-Based Technologies for High-Performance Computing
Show others...
2018 (English)In: PARALLEL PROCESSING AND APPLIED MATHEMATICS (PPAM 2017), PT II / [ed] Wyrzykowski, R Dongarra, J Deelman, E Karczewski, K, SPRINGER INTERNATIONAL PUBLISHING AG , 2018, p. 264-274Conference paper, Published paper (Refereed)
Abstract [en]

Task-based programming models for shared memory - such as Cilk Plus and OpenMP 3 - are well established and documented. However, with the increase in heterogeneous, many-core and parallel systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing, no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.

Place, publisher, year, edition, pages
SPRINGER INTERNATIONAL PUBLISHING AG, 2018
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 10778
Keywords
Task-based parallelism, Taxonomy, API, Runtime system, Scheduler, Monitoring framework, Fault tolerance
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-245025 (URN)10.1007/978-3-319-78054-2_25 (DOI)000458563900025 ()2-s2.0-85044764286 (Scopus ID)978-3-319-78054-2 (ISBN)
Conference
12th International Conference on Parallel Processing and Applied Mathematics (PPAM), SEP 10-13, 2017, Lublin, POLAND
Note

QC 20190305

Available from: 2019-03-05 Created: 2019-03-05 Last updated: 2019-03-05Bibliographically approved
Al Ahad, M. A., Simmendinger, C., Iakymchuk, R., Laure, E. & Markidis, S. (2018). Efficient Algorithms for Collective Operations with Notified Communication in Shared Windows. In: PROCEEDINGS OF PAW-ATM18: 2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM). Paper presented at 2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM) (pp. 1-10). IEEE
Open this publication in new window or tab >>Efficient Algorithms for Collective Operations with Notified Communication in Shared Windows
Show others...
2018 (English)In: PROCEEDINGS OF PAW-ATM18: 2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM), IEEE , 2018, p. 1-10Conference paper, Published paper (Refereed)
Abstract [en]

Collective operations are commonly used in various parts of scientific applications. Especially in strong scaling scenarios collective operations can negatively impact the overall applications performance: while the load per rank here decreases with increasing core counts, time spent in e.g. barrier operations will increase logarithmically with the core count. In this article, we develop novel algorithmic solutions for collective operations such as Allreduce and Allgather(V)-by leveraging notified communication in shared windows. To this end, we have developed an extension of GASPI which enables all ranks participating in a shared window to observe the entire notified communication targeted at the window. By exploring benefits of this extension, we deliver high performing implementations of Allreduce and Allgather(V) on Intel and Cray clusters. These implementations clearly achieve 2x-4x performance improvements compared to the best performing MPI implementations for various data distributions.

Place, publisher, year, edition, pages
IEEE, 2018
Keywords
Collectives, Allreduce, Allgather, AllgatherV, MPI, PGAS, GASPI, shared windows, shared notifications
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-249835 (URN)10.1109/PAW-ATM.2018.00006 (DOI)000462965600001 ()2-s2.0-85063078028 (Scopus ID)
Conference
2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM)
Note

QC 20190423

Available from: 2019-04-23 Created: 2019-04-23 Last updated: 2019-04-23Bibliographically approved
Akhmetova, D., Cebamanos, L., Iakymchuk, R., Rotaru, T., Rahn, M., Markidis, S., . . . Simmendinger, C. (2018). Interoperability of GASPI and MPI in large scale scientific applications. In: 12th International Conference on Parallel Processing and Applied Mathematics, PPAM 2017: . Paper presented at 10 September 2017 through 13 September 2017 (pp. 277-287). Springer Verlag
Open this publication in new window or tab >>Interoperability of GASPI and MPI in large scale scientific applications
Show others...
2018 (English)In: 12th International Conference on Parallel Processing and Applied Mathematics, PPAM 2017, Springer Verlag , 2018, p. 277-287Conference paper, Published paper (Refereed)
Abstract [en]

One of the main hurdles of a broad distribution of PGAS approaches is the prevalence of MPI, which as a de-facto standard appears in the code basis of many applications. To take advantage of the PGAS APIs like GASPI without a major change in the code basis, interoperability between MPI and PGAS approaches needs to be ensured. In this article, we address this challenge by providing our study and preliminary performance results regarding interoperating GASPI and MPI on the performance crucial parts of the Ludwig and iPIC3D applications. In addition, we draw a strategy for better coupling of both APIs. 

Place, publisher, year, edition, pages
Springer Verlag, 2018
Keywords
GASPI, Halo exchange, Interoperability, iPIC3D, Ludwig, MPI, Artificial intelligence, Computer science, Computers, De facto standard, Preliminary performance results, Scientific applications
National Category
Mathematics
Identifiers
urn:nbn:se:kth:diva-227469 (URN)10.1007/978-3-319-78054-2_26 (DOI)000458563900026 ()2-s2.0-85044787063 (Scopus ID)9783319780535 (ISBN)
Conference
10 September 2017 through 13 September 2017
Note

QC 20180521

Available from: 2018-05-21 Created: 2018-05-21 Last updated: 2019-03-05Bibliographically approved
Iakymchuk, R., Shakhno, S. M. & Yarmola, H. P. (2017). CONVERGENCE ANALYSIS OF A TWO-STEP MODIFICATION OF THE GAUSS-NEWTON METHOD AND ITS APPLICATIONS. JOURNAL OF NUMERICAL AND APPLIED MATHEMATICS, 3(126), 61-74
Open this publication in new window or tab >>CONVERGENCE ANALYSIS OF A TWO-STEP MODIFICATION OF THE GAUSS-NEWTON METHOD AND ITS APPLICATIONS
2017 (English)In: JOURNAL OF NUMERICAL AND APPLIED MATHEMATICS, ISSN 0868-6912, Vol. 3, no 126, p. 61-74Article in journal (Refereed) Published
Abstract [en]

We investigate the convergence of a two-step modification of the Gauss-Newton method applying the generalized Lipschitz condition for the first- and second-order derivatives. The convergence order as well as the convergence radius of the method are studied and the uniqueness ball of the solution of the nonlinear least squares problem is examined. Finally, we carry out numerical experiments on a set of well-known test problems.

Place, publisher, year, edition, pages
IVAN FRANKO NATL UNIV LVIV, 2017
Keywords
Least squares problem, Gauss-Newton method, Lipschitz conditions with L average, radius of convergence, uniqueness ball
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-223841 (URN)000425042700005 ()
Note

QC 20180306

Available from: 2018-03-06 Created: 2018-03-06 Last updated: 2018-03-06Bibliographically approved
Markidis, S., Peng, I., Iakymchuk, R., Laure, E., Kestor, G. & Gioiosa, R. (2016). A performance characterization of streaming computing on supercomputers. In: Procedia Computer Science: . Paper presented at International Conference on Computational Science, ICCS 2016, 6 June 2016 through 8 June 2016 (pp. 98-107). Elsevier
Open this publication in new window or tab >>A performance characterization of streaming computing on supercomputers
Show others...
2016 (English)In: Procedia Computer Science, Elsevier, 2016, p. 98-107Conference paper, Published paper (Refereed)
Abstract [en]

Streaming computing models allow for on-the-y processing of large data sets. With the increased demand for processing large amount of data in a reasonable period of time, streaming models are more and more used on supercomputers to solve data-intensive problems. Because supercomputers have been mainly used for compute-intensive workload, supercomputer performance metrics focus on the number of oating point operations in time and cannot fully characterize a streaming application performance on supercomputers. We introduce the injection and processing rates as the main metrics to characterize the performance of streaming computing on supercomputers. We analyze the dynamics of these quantities in a modi ed STREAM benchmark developed atop of an MPI streaming library in a series of di erent congurations. We show that after a brief transient the injection and processing rates converge to sustained rates. We also demonstrate that streaming computing performance strongly depends on the number of connections between data producers and consumers and on the processing task granularity.

Place, publisher, year, edition, pages
Elsevier, 2016
Keywords
Big data, Data-driven applications, High-performance computing, Streaming computing, Data handling, Supercomputers, Computing performance, High performance computing, Performance characterization, Performance metrics, Processing rates, Streaming applications, Task granularity
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-195477 (URN)10.1016/j.procs.2016.05.301 (DOI)2-s2.0-84978536252 (Scopus ID)
Conference
International Conference on Computational Science, ICCS 2016, 6 June 2016 through 8 June 2016
Note

Funding Details: 671500, EC, European Commission

QC 20161125

Available from: 2016-11-25 Created: 2016-11-03 Last updated: 2018-01-13Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-2414-700X

Search in DiVA

Show all publications