Change search
Link to record
Permanent link

Direct link
BETA
Publications (10 of 16) Show all publications
Thoman, P., Dichev, K., Heller, T., Iakymchuk, R., Aguilar, X., Hasanov, K., . . . Nikolopoulos, D. S. (2018). A taxonomy of task-based parallel programming technologies for high-performance computing. Journal of Supercomputing, 74(4), 1422-1434
Open this publication in new window or tab >>A taxonomy of task-based parallel programming technologies for high-performance computing
Show others...
2018 (English)In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 74, no 4, p. 1422-1434Article in journal (Refereed) Published
Abstract [en]

Task-based programming models for shared memory-such as Cilk Plus and OpenMP 3-are well established and documented. However, with the increase in parallel, many-core, and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.

Place, publisher, year, edition, pages
SPRINGER, 2018
Keywords
High-performance computing, Task-based parallelism, Taxonomy, API, Runtime system, Scheduler, Monitoring framework, Fault tolerance
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-226199 (URN)10.1007/s11227-018-2238-4 (DOI)000428284000002 ()2-s2.0-85041817729 (Scopus ID)
Note

QC 20180518

Available from: 2018-05-18 Created: 2018-05-18 Last updated: 2019-08-20Bibliographically approved
Aguilar, X., Fürlinger, K. & Laure, E. (2016). Online MPI trace compression using event flow graphs and wavelets. In: Procedia Computer Science: . Paper presented at International Conference on Computational Science, ICCS 2016, 6 June 2016 through 8 June 2016 (pp. 1497-1506). Elsevier
Open this publication in new window or tab >>Online MPI trace compression using event flow graphs and wavelets
2016 (English)In: Procedia Computer Science, Elsevier, 2016, p. 1497-1506Conference paper, Published paper (Refereed)
Abstract [en]

Performance analysis of scientific parallel applications is essential to use High Performance Computing (HPC) infrastructures efficiently. Nevertheless, collecting detailed data of large-scale parallel programs and long-running applications is infeasible due to the huge amount of performance information generated. Even though there are no technological constraints in storing Terabytes of performance data, the constant flushing of such data to disk introduces a massive overhead into the application that makes the performance measurements worthless. This paper explores the use of Event flow graphs together with wavelet analysis and EZW-encoding to provide MPI event traces that are orders of magnitude smaller while preserving accurate information on timestamped events. Our mechanism compresses the performance data online while the application runs, thus, reducing the pressure put on the I/O system due to buffer flushing. As a result, we achieve lower application perturbation, reduced performance data output, and the possibility to monitor longer application runs. © The Authors. Published by Elsevier B.V.

Place, publisher, year, edition, pages
Elsevier, 2016
Keywords
Event flow graphs, EZW coding, MPI performance monitoring, Trace compression, Wavelets, Application programs, Graphic methods, Wavelet analysis, Event-flow graph, Performance monitoring, Flow graphs
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-194600 (URN)10.1016/j.procs.2016.05.471 (DOI)2-s2.0-84978517854 (Scopus ID)
Conference
International Conference on Computational Science, ICCS 2016, 6 June 2016 through 8 June 2016
Note

Conference Paper. QC 20161102

Available from: 2016-11-02 Created: 2016-10-31 Last updated: 2016-11-02Bibliographically approved
Aguilar, X., Fuerlinger, K. & Laure, E. (2015). Automatic On-Line Detection of MPI Application Structure with Event Flow Graphs. In: EURO-PAR 2015: PARALLEL PROCESSING. Paper presented at 21st International Conference on Parallel and Distributed Computing (Euro-Par), AUG 24-28, 2015, Vienna, AUSTRIA (pp. 70-81). Springer Berlin/Heidelberg
Open this publication in new window or tab >>Automatic On-Line Detection of MPI Application Structure with Event Flow Graphs
2015 (English)In: EURO-PAR 2015: PARALLEL PROCESSING, Springer Berlin/Heidelberg, 2015, p. 70-81Conference paper, Published paper (Refereed)
Abstract [en]

The deployment of larger and larger HPC systems challenges the scalability of both applications and analysis tools. Performance analysis toolsets provide users with means to spot bottlenecks in their applications by either collecting aggregated statistics or generating loss-less time-stamped traces. While obtaining detailed trace information is the best method to examine the behavior of an application in detail, it is infeasible at extreme scales due to the huge volume of data generated. In this context, knowing the application structure, and particularly the nesting of loops in iterative applications is of great importance as it allows, among other things, to reduce the amount of data collected by focusing on important sections of the code. In this paper we demonstrate how the loop nesting structure of an MPI application can be extracted on-line from its event flow graph without the need of any explicit source code instrumentation. We show how this knowledge on the application structure can be used to compute postmortem statistics as well as to reduce the amount of redundant data collected. To that end, we present a usage scenario where this structure information is utilized on-line (while the application runs) to intelligently collect fine-grained data for only a few iterations of an application, considerably reducing the amount of data gathered.

Place, publisher, year, edition, pages
Springer Berlin/Heidelberg, 2015
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 9233
Keywords
Application structure detection, Flow graph analysis, Performance monitoring, Online analysis, Automatic loop detection
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-177430 (URN)10.1007/978-3-662-48096-0_6 (DOI)000363786800006 ()2-s2.0-84944047051 (Scopus ID)978-3-662-48096-0; 978-3-662-48095-3 (ISBN)978-3-662-48095-3 (ISBN)
Conference
21st International Conference on Parallel and Distributed Computing (Euro-Par), AUG 24-28, 2015, Vienna, AUSTRIA
Note

QC 20151124

Available from: 2015-11-24 Created: 2015-11-20 Last updated: 2018-01-10Bibliographically approved
Aguilar, X. (2015). Towards Scalable Performance Analysis of MPI Parallel Applications. (Licentiate dissertation). Stockholm: KTH Royal Institute of Technology
Open this publication in new window or tab >>Towards Scalable Performance Analysis of MPI Parallel Applications
2015 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

  A considerably fraction of science discovery is nowadays relying on computer simulations. High Performance Computing  (HPC) provides scientists with the means to simulate processes ranging from climate modeling to protein folding. However, achieving good application performance and making an optimal use of HPC resources is a heroic task due to the complexity of parallel software. Therefore, performance tools  and runtime systems that help users to execute  applications in the most optimal way are of utmost importance in the landscape of HPC.  In this thesis, we explore different techniques to tackle the challenges of collecting, storing, and using  fine-grained performance data. First, we investigate the automatic use of real-time performance data in order to run applications in an optimal way. To that end, we present a prototype of an adaptive task-based runtime system that uses real-time performance data for task scheduling. This runtime system has a performance monitoring component that provides real-time access to the performance behavior of anapplication while it runs. The implementation of this monitoring component is presented and evaluated within this thesis. Secondly, we explore lossless compression approaches  for MPI monitoring. One of the main problems that  performance tools face is the huge amount of fine-grained data that can be generated from an instrumented application. Collecting fine-grained data from a program is the best method to uncover the root causes of performance bottlenecks, however, it is unfeasible with extremely parallel applications  or applications with long execution times. On the other hand, collecting coarse-grained data is scalable but  sometimes not enough to discern the root cause of a performance problem. Thus, we propose a new method for performance monitoring of MPI programs using event flow graphs. Event flow graphs  provide very low overhead in terms of execution time and  storage size, and can be used to reconstruct fine-grained trace files of application events ordered in time.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2015. p. viii, 39
Series
TRITA-CSC-A, ISSN 1653-5723 ; 2015:05
Keywords
parallel computing, performance monitoring, performance tools, event flow graphs
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-165043 (URN)978-91-7595-518-6 (ISBN)
Presentation
2015-05-20, The Visualization Studio, room 4451, Lindstedtsvägen 5, KTH, Stockholm, 10:00 (English)
Opponent
Supervisors
Note

QC 20150508

Available from: 2015-05-08 Created: 2015-04-21 Last updated: 2015-05-08Bibliographically approved
Aguilar, X., Fürlinger, K. & Laure, E. (2015). Visual MPI Performance Analysis using Event Flow Graphs. Paper presented at International Conference On Computational Science, ICCS 2015 Computational Science at the Gates of Nature. Procedia Computer Science, 51, 1353-1362
Open this publication in new window or tab >>Visual MPI Performance Analysis using Event Flow Graphs
2015 (English)In: Procedia Computer Science, ISSN 1877-0509, E-ISSN 1877-0509, Vol. 51, p. 1353-1362Article in journal (Refereed) Published
Abstract [en]

Event flow graphs used in the context of performance monitoring combine the scalability and low overhead of profiling methods with lossless information recording of tracing tools. In other words, they capture statistics on the performance behavior of parallel applications while pre- serving the temporal ordering of events. Event flow graphs require significantly less storage than regular event traces and can still be used to recover the full ordered sequence of events performed by the application.  In this paper we explore the usage of event flow graphs in the context of visual performance analysis. We show that graphs can be used to quickly spot performance problems, helping to better understand the behavior of an application. We demonstrate our performance analysis approach with MiniFE, a mini-application that mimics the key performance aspects of finite- element applications in High Performance Computing (HPC).

Place, publisher, year, edition, pages
Elsevier, 2015
Keywords
visual performance analysis, event flow graphs, loop detection, MPI monitoring
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-168701 (URN)10.1016/j.procs.2015.05.322 (DOI)
Conference
International Conference On Computational Science, ICCS 2015 Computational Science at the Gates of Nature
Note

QC 20150617

Available from: 2015-06-08 Created: 2015-06-08 Last updated: 2017-12-04Bibliographically approved
Aguilar, X., Fürlinger, K. & Laure, E. (2014). MPI Trace Compression Using Event Flow Graphs. In: : . Paper presented at Euro-Par 2014 Parallel Processing (pp. 1-12).
Open this publication in new window or tab >>MPI Trace Compression Using Event Flow Graphs
2014 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Understanding how parallel applications behave is crucial for using high-performance computing (HPC) resources efficiently. However, the task of performance analysis is becoming increasingly difficult due to the growing complexity of scientific codes and the size of machines. Even though many tools have been developed over the past years to help in this task, current approaches either only offer an overview of the application discarding temporal information, or they generate huge trace files that are often difficult to handle.

In this paper we propose the use of event flow graphs for monitoring MPI applications, a new and different approach that balances the low overhead of profiling tools with the abundance of information available from tracers. Event flow graphs are captured with very low overhead, require orders of magnitude less storage than standard trace files, and can still recover the full sequence of events in the application. We test this new approach with the NERSC-8/Trinity Benchmark suite and achieve compression ratios up to 119x.

Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 8632
Keywords
MPI event flow graphs, trace compression, trace reconstruction, performance monitoring
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-165042 (URN)2-s2.0-84958532986 (Scopus ID)
Conference
Euro-Par 2014 Parallel Processing
Note

QC 20150423. QC 20160314

Available from: 2015-04-21 Created: 2015-04-21 Last updated: 2017-04-28Bibliographically approved
Aguilar, X., Laure, E. & Fürlinger, K. (2014). Online Performance Data Introspection with IPM. In: Proceedings of the 15th IEEE International Conference on High Performance Computing and Communications (HPCC 2013): . Paper presented at The 15th IEEE International Conference on High Performance Computing and Communications (HPCC 2013). Zhangjiajie , China , November 13-15, 2013. (pp. 728-734). IEEE Computer Society
Open this publication in new window or tab >>Online Performance Data Introspection with IPM
2014 (English)In: Proceedings of the 15th IEEE International Conference on High Performance Computing and Communications (HPCC 2013), IEEE Computer Society, 2014, p. 728-734Conference paper, Published paper (Refereed)
Abstract [en]

Exascale systems will be heterogeneous architectures with multiple levels of concurrency and energy constraints. In such a complex scenario, performance monitoring and runtime systems play a major role to obtain good application performance and scalability. Furthermore, online access to performance data becomes a necessity to decide how to schedule resources and orchestrate computational elements: processes, threads, tasks, etc. We present the Performance Introspection API, an extension of the IPM tool that provides online runtime access to performance data from an application while it runs. We describe its design and implementation and show its overhead on several test benchmarks. We also present a real test case using the Performance Introspection API in conjunction with processor frequency scaling to reduce power consumption.

Place, publisher, year, edition, pages
IEEE Computer Society, 2014
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-136212 (URN)10.1109/HPCC.and.EUC.2013.107 (DOI)2-s2.0-84903964607 (Scopus ID)978-076955088-6 (ISBN)
Conference
The 15th IEEE International Conference on High Performance Computing and Communications (HPCC 2013). Zhangjiajie , China , November 13-15, 2013.
Note

QC 20140602

Available from: 2013-12-04 Created: 2013-12-04 Last updated: 2015-05-08Bibliographically approved
Markidis, S., Schliephake, M., Aguilar, X., Henty, D., Richardson, H., Hart, A., . . . Laure, E. (2013). Paving the path to exascale computing with CRESTA development environment. In: : . Paper presented at Exascale Software and Applications Conference.
Open this publication in new window or tab >>Paving the path to exascale computing with CRESTA development environment
Show others...
2013 (English)Conference paper, Oral presentation with published abstract (Other academic)
Abstract [en]

The development and implementation of efficient computer codes for exascale supercomputers will require combined advancement of all development environment components: compilers, automatic tuning frameworks, run-time systems, debuggers and performance monitoring and analysis tools. The exascale era poses unprecedented challenges. Because the presence of accelerators is more and more common among the fastest supercomputer and will play a role in exascale computing, compilers will need to support hybrid computer architectures and generate efficient code hiding the complexity of programming accelerators. Hand optimization of the code will be very difficult on exascale machine and will be increasingly assisted by automatic tuners. Application tuning will be more focus on parallel aspects of the computation because of large amount of available parallelism. The application workload will be distributed over million of processes, and to implement ad-hoc strategies directly in the application will be probably unfeasible while an adaptive run-time system will provide automatic load balancing. Debuggers and performance monitoring tools will deal with million processes and with huge amount of data from application and hardware counters, but they will still be required to minimize the overhead and retain scalability. In this talk, we present how the development environment of the CRESTA exascale EC project meets all these challenges by advancing the state of the art in the field.

An investigation of compiler support for hybrid GPU programming, the design concepts, and the main characteristics of the alpha prototype implementation of the CRESTA development environment components for exascale computing are presented. A performance study of OpenACC compiler directives has been carried out, showing very promising results and indicating OpenACC as viable approach for programming hybrid exascale supercomputer. A new Domain-Specific Language (DSL) has been defined for the expression of parallel auto-tuning at very large scale. The focus of on the extension of the auto-tuning approach into the parallel domain to enable tuning of communication-related aspects of application. A new adaptive run-time system has been designed to schedule processes depending on the resource availability, on the workload, and on the run-time analysis of the application performance. The Allinea DDT debugger and the Dresden University of Technology MUST MPI correctness checker are being extended to provide a unified interface, to improve scalability, and to include new disruptive technology based on statistical analysis of run-time behavior of the application for anomalies detection. The new exascale prototypes of the Dresden University of Technology Vampir, VampirTrace and Score-P performance monitoring and analysis tools have been released. The new features include the possibility of applying filtering technique before loading performance data to drastically reduce memory needs during the performance analysis. The initial evaluation study of the development environment is targeted on the CRESTA project applications to determine how the development environment could be coupled into a production suite for exascale computing.

National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-139548 (URN)
Conference
Exascale Software and Applications Conference
Note

QC 20140624

Available from: 2014-01-15 Created: 2014-01-15 Last updated: 2018-01-11Bibliographically approved
Aguilar, X., Schliephake, M., Vahtras, O., Gimenez, J. & Laure, E. (2013). Scalability analysis of Dalton, a molecular structure program. Future generations computer systems, 29(8), 2197-2204
Open this publication in new window or tab >>Scalability analysis of Dalton, a molecular structure program
Show others...
2013 (English)In: Future generations computer systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 29, no 8, p. 2197-2204Article in journal (Refereed) Published
Abstract [en]

Dalton is a molecular electronic structure program featuring common methods of computational chemistry that are based on pure quantum mechanics (QM) as well as hybrid quantum mechanics/molecular mechanics (QM/MM). It is specialized and has a leading position in calculation of molecular properties with a large world-wide user community (over 2000 licenses issued). In this paper, we present a performance characterization and optimization of Dalton. We also propose a solution to avoid the master/worker design of Dalton to become a performance bottleneck for larger process numbers. With these improvements we obtain speedups of 4x, increasing the parallel efficiency of the code and being able to run in it in a much bigger number of cores.

Keywords
Performance analysis, Optimization, Scalability
National Category
Computer Systems
Research subject
SRA - E-Science (SeRC)
Identifiers
urn:nbn:se:kth:diva-136200 (URN)10.1016/j.future.2013.04.013 (DOI)000326613400028 ()2-s2.0-84886093468 (Scopus ID)
Funder
Swedish e‐Science Research Center
Note

QC 20131216

Available from: 2013-12-04 Created: 2013-12-04 Last updated: 2017-12-06Bibliographically approved
Pons, C., Jimenez-Gonzalez, D., Gonzalez-Alvarez, C., Servat, H., Cabrera-Benitez, D., Aguilar, X. & Fernandez-Recio, J. (2012). Cell-Dock: high-performance protein-protein docking. Bioinformatics, 28(18), 2394-2396
Open this publication in new window or tab >>Cell-Dock: high-performance protein-protein docking
Show others...
2012 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 28, no 18, p. 2394-2396Article in journal (Refereed) Published
Abstract [en]

The application of docking to large-scale experiments or the explicit treatment of protein flexibility are part of the new challenges in structural bioinformatics that will require large computer resources and more efficient algorithms. Highly optimized fast Fourier transform (FFT) approaches are broadly used in docking programs but their optimal code implementation leaves hardware acceleration as the only option to significantly reduce the computational cost of these tools. In this work we present Cell-Dock, an FFT-based docking algorithm adapted to the Cell BE processor. We show that Cell-Dock runs faster than FTDock with maximum speedups of above 200x, while achieving results of similar quality.

National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:kth:diva-103367 (URN)10.1093/bioinformatics/bts454 (DOI)000308532300065 ()2-s2.0-84866454213 (Scopus ID)
Funder
Swedish e‐Science Research Center
Note

QC 20121016

Available from: 2012-10-16 Created: 2012-10-11 Last updated: 2018-01-12Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-9693-6265

Search in DiVA

Show all publications