Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 85) Show all publications
Oz, I., Bhatti, M. K., Popov, K. & Brorsson, M. (2019). Regression-Based Prediction for Task-Based Program Performance. Journal of Circuits, Systems and Computers, 28(4), Article ID 1950060.
Open this publication in new window or tab >>Regression-Based Prediction for Task-Based Program Performance
2019 (English)In: Journal of Circuits, Systems and Computers, ISSN 0218-1266, Vol. 28, no 4, article id 1950060Article in journal (Refereed) Published
Abstract [en]

As multicore systems evolve by increasing the number of parallel execution units, parallel programming models have been released to exploit parallelism in the applications. Task-based programming model uses task abstractions to specify parallel tasks and schedules tasks onto processors at runtime. In order to increase the efficiency and get the highest performance, it is required to identify which runtime configuration is needed and how processor cores must be shared among tasks. Exploring design space for all possible scheduling and runtime options, especially for large input data, becomes infeasible and requires statistical modeling. Regression-based modeling determines the effects of multiple factors on a response variable, and makes predictions based on statistical analysis. In this work, we propose a regression-based modeling approach to predict the task-based program performance for different scheduling parameters with variable data size. We execute a set of task-based programs by varying the runtime parameters, and conduct a systematic measurement for influencing factors on execution time. Our approach uses executions with different configurations for a set of input data, and derives different regression models to predict execution time for larger input data. Our results show that regression models provide accurate predictions for validation inputs with mean error rate as low as 6.3%, and 14% on average among four task-based programs.

Place, publisher, year, edition, pages
WORLD SCIENTIFIC PUBL CO PTE LTD, 2019
Keywords
Performance prediction, task-based programs, regression
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-249799 (URN)10.1142/S0218126619500609 (DOI)000462969800009 ()2-s2.0-85049081368 (Scopus ID)
Note

QC 20190424

Available from: 2019-04-24 Created: 2019-04-24 Last updated: 2019-04-24Bibliographically approved
Bhatti, M. K., Oz, I., Amin, S., Mushtaq, M., Farooq, U., Popov, K. & Brorsson, M. (2018). Locality-aware task scheduling for homogeneous parallel computing systems. Computing, 100(6), 557-595
Open this publication in new window or tab >>Locality-aware task scheduling for homogeneous parallel computing systems
Show others...
2018 (English)In: Computing, ISSN 0010-485X, E-ISSN 1436-5057, Vol. 100, no 6, p. 557-595Article in journal (Refereed) Published
Abstract [en]

In systems with complex many-core cache hierarchy, exploiting data locality can significantly reduce execution time and energy consumption of parallel applications. Locality can be exploited at various hardware and software layers. For instance, by implementing private and shared caches in a multi-level fashion, recent hardware designs are already optimised for locality. However, this would all be useless if the software scheduling does not cast the execution in a manner that promotes locality available in the programs themselves. Since programs for parallel systems consist of tasks executed simultaneously, task scheduling becomes crucial for the performance in multi-level cache architectures. This paper presents a heuristic algorithm for homogeneous multi-core systems called locality-aware task scheduling (LeTS). The LeTS heuristic is a work-conserving algorithm that takes into account both locality and load balancing in order to reduce the execution time of target applications. The working principle of LeTS is based on two distinctive phases, namely; working task group formation phase (WTG-FP) and working task group ordering phase (WTG-OP). The WTG-FP forms groups of tasks in order to capture data reuse across tasks while the WTG-OP determines an optimal order of execution for task groups that minimizes the reuse distance of shared data between tasks. We have performed experiments using randomly generated task graphs by varying three major performance parameters, namely: (1) communication to computation ratio (CCR) between 0.1 and 1.0, (2) application size, i.e., task graphs comprising of 50-, 100-, and 300-tasks per graph, and (3) number of cores with 2-, 4-, 8-, and 16-cores execution scenarios. We have also performed experiments using selected real-world applications. The LeTS heuristic reduces overall execution time of applications by exploiting inter-task data locality. Results show that LeTS outperforms state-of-the-art algorithms in amortizing inter-task communication cost.

Place, publisher, year, edition, pages
Springer, 2018
Keywords
Runtime resource management, Parallel computing, Multicore scheduling, Homogeneous systems, Directed acyclic graph (DAG), Embedded systems
National Category
Other Physics Topics
Identifiers
urn:nbn:se:kth:diva-230491 (URN)10.1007/s00607-017-0581-6 (DOI)000432601500001 ()2-s2.0-85032798462 (Scopus ID)
Note

QC 20180614

Available from: 2018-06-14 Created: 2018-06-14 Last updated: 2018-06-14Bibliographically approved
Du, M., Sassioui, R., Varisteas, G., State, R., Brorsson, M. & Cherkaoui, O. (2017). Improving real-time bidding using a constrained markov decision process. In: 13th International Conference on Advanced Data Mining and Applications, ADMA 2017: . Paper presented at 13th International Conference on Advanced Data Mining and Applications, ADMA 2017, Singapore, 5 November 2017 through 6 November 2017 (pp. 711-726). Springer, 10604
Open this publication in new window or tab >>Improving real-time bidding using a constrained markov decision process
Show others...
2017 (English)In: 13th International Conference on Advanced Data Mining and Applications, ADMA 2017, Springer, 2017, Vol. 10604, p. 711-726Conference paper, Published paper (Refereed)
Abstract [en]

Online advertising is increasingly switching to real-time bidding on advertisement inventory, in which the ad slots are sold through real-time auctions upon users visiting websites or using mobile apps. To compete with unknown bidders in such a highly stochastic environment, each bidder is required to estimate the value of each impression and to set a competitive bid price. Previous bidding algorithms have done so without considering the constraint of budget limits, which we address in this paper. We model the bidding process as a Constrained Markov Decision Process based reinforcement learning framework. Our model uses the predicted click-through-rate as the state, bid price as the action, and ad clicks as the reward. We propose a bidding function, which outperforms the state-of-the-art bidding functions in terms of the number of clicks when the budget limit is low. We further simulate different bidding functions competing in the same environment and report the performances of the bidding strategies when required to adapt to a dynamic environment.

Place, publisher, year, edition, pages
Springer, 2017
Series
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), ISSN 0302-9743 ; 10604
Keywords
Display Advertising, Markov Decision Process, Real-time bidding, Reinforcement Learning
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-218314 (URN)10.1007/978-3-319-69179-4_50 (DOI)000449973300050 ()2-s2.0-85033689734 (Scopus ID)9783319691787 (ISBN)
Conference
13th International Conference on Advanced Data Mining and Applications, ADMA 2017, Singapore, 5 November 2017 through 6 November 2017
Note

QC 20171127

Available from: 2017-11-27 Created: 2017-11-27 Last updated: 2018-11-26Bibliographically approved
Muddukrishna, A., Jonsson, P. A., Podobas, A. & Brorsson, M. (2016). Grain Graphs: OpenMP Performance Analysis Made Easy. In: : . Paper presented at 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'16). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Grain Graphs: OpenMP Performance Analysis Made Easy
2016 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Average programmers struggle to solve performance problems in OpenMP programs with tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task performance from the runtime system's perspective where task execution is interleaved with other tasks in an unpredictable order. Problems with OpenMP parallel for-loops are similarly difficult to resolve since tools only visualize aggregate thread-level statistics such as load imbalance without zooming into a per-chunk granularity. The runtime system/threads oriented visualization provides poor support for understanding problems with task and chunk execution time, parallelism, and memory hierarchy utilization, forcing average programmers to rely on experts or use tedious trial-and-error tuning methods for performance. We present grain graphs, a new OpenMP performance analysis method that visualizes grains - computation performed by a task or a parallel for-loop chunk instance - and highlights problems such as low parallelism, work inflation and poor parallelization benefit at the grain level. We demonstrate that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing visualizations in standard OpenMP programs, simplifying OpenMP performance analysis. This enables average programmers to make portable optimizations for poor performing OpenMP programs, reducing pressure on experts and removing the need for tedious trial-and-error tuning.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2016
Keywords
OpenMP, Performance Analysis, Parallel Programming
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-179668 (URN)10.1145/2851141.2851156 (DOI)000393580200029 ()2-s2.0-84963732767 (Scopus ID)
Conference
21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'16)
Note

QC 20170313

Available from: 2015-12-18 Created: 2015-12-18 Last updated: 2017-03-13Bibliographically approved
Muddukrishna, A., Jonsson, P. A. & Brorsson, M. (2015). Characterizing task-based OpenMP programs. PLoS ONE, 10(4), e0123545
Open this publication in new window or tab >>Characterizing task-based OpenMP programs
2015 (English)In: PLoS ONE, ISSN 1932-6203, E-ISSN 1932-6203, Vol. 10, no 4, p. e0123545-Article in journal (Refereed) Published
Abstract [en]

Programmers struggle to understand performance of task-based OpenMP programs since profiling tools only report thread-based performance. Performance tuning also requires task-based performance in order to balance per-task memory hierarchy utilization against exposed task parallelism. We provide a cost-effective method to extract detailed task-based performance information from OpenMP programs. We demonstrate the utility of our method by quickly diagnosing performance problems and characterizing exposed task parallelism and per-task instruction profiles of benchmarks in the widely-used Barcelona OpenMP Tasks Suite. Programmers can tune performance faster and understand performance tradeoffs more effectively than existing tools by using our method to characterize task-based performance.

Keywords
Scheduling Strategies, Performance Analysis, Benchmark
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-141201 (URN)10.1371/journal.pone.0123545 (DOI)000352590300104 ()25860023 (PubMedID)2-s2.0-84929498034 (Scopus ID)
Note

QC 20150623. Updated from "Manuscript" to "Article".

Available from: 2014-02-12 Created: 2014-02-12 Last updated: 2017-12-06Bibliographically approved
Issa, S., Romano, P. & Brorsson, M. (2015). Green-CM: Energy efficient contention management for Transactional Memory. In: 44th International Conference on Parallel Processing (ICPP), September 2015: . Paper presented at International Conference on Parallel Processing (ICPP). IEEE
Open this publication in new window or tab >>Green-CM: Energy efficient contention management for Transactional Memory
2015 (English)In: 44th International Conference on Parallel Processing (ICPP), September 2015, IEEE , 2015Conference paper, Published paper (Refereed)
Abstract [en]

Transactional memory (TM) is emerging as an attractive synchronization mechanism for concurrent computing. In this work we aim at filling a relevant gap in the TM literature, by investigating the issue of energy efficiency for one crucial building block of TM systems: contention management. Green-CM, the solution proposed in this paper, is the first contention management scheme explicitly designed to jointly optimize both performance and energy consumption. To this end Green-TM combines three key mechanisms: i) it leverages on a novel asymmetric design, which combines different backoff policies in order to take advantage of dynamic frequency and voltage scaling; ii) it introduces an energy efficient design of the back-off mechanism, which combines spin-based and sleep-based implementations; iii) it makes extensive use of selftuning mechanisms to pursue optimal efficiency across highly heterogeneous workloads. We evaluate Green-CM from both the energy and performance perspectives, and show that it can achieve enhanced efficiency by up to 2.35 times with respect to state of the art contention managers, with an average gain of more than 60% when using 64 threads.

Place, publisher, year, edition, pages
IEEE, 2015
Keywords
Transactional Memory, Energy Efficiency
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-185325 (URN)10.1109/ICPP.2015.64 (DOI)000379202700056 ()2-s2.0-84976511926 (Scopus ID)
Conference
International Conference on Parallel Processing (ICPP)
Note

QC 20160421

Available from: 2016-04-16 Created: 2016-04-16 Last updated: 2016-08-12Bibliographically approved
Awan, A. J., Brorsson, M., Vlassov, V. & Ayguade, E. (2015). How Data Volume Affects Spark Based Data Analytics on a Scale-up Server. In: Big Data Benchmarks, Performance Optimization, and Emerging Hardware: 6th Workshop, BPOE 2015, Kohala, HI, USA, August 31 - September 4, 2015. Revised Selected Papers. Paper presented at 6th International Workshop on Bigdata Benchmarks, Performance Optimization and Emerging Hardware (BpoE), held in conjunction with 41st International Conference on Very Large Data Bases (VLDB),Kohala, HI, USA, August 31 - September 4, 2015 (pp. 81-92). Springer, 9495
Open this publication in new window or tab >>How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
2015 (English)In: Big Data Benchmarks, Performance Optimization, and Emerging Hardware: 6th Workshop, BPOE 2015, Kohala, HI, USA, August 31 - September 4, 2015. Revised Selected Papers, Springer, 2015, Vol. 9495, p. 81-92Conference paper, Published paper (Refereed)
Abstract [en]

Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefit by using more than 12 cores for an executor. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10 % better instruction retirement rate (due to lower L1 cache misses and higher core utilization). We match memory behaviour with the garbage collector to improve performance of applications between 1.6x to 3x.

Place, publisher, year, edition, pages
Springer, 2015
Series
Lecture Notes in Computer Science
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-181325 (URN)10.1007/978-3-319-29006-5_7 (DOI)2-s2.0-84958073801 (Scopus ID)978-3-319-29005-8 (ISBN)
Conference
6th International Workshop on Bigdata Benchmarks, Performance Optimization and Emerging Hardware (BpoE), held in conjunction with 41st International Conference on Very Large Data Bases (VLDB),Kohala, HI, USA, August 31 - September 4, 2015
Note

QC 20160224

Available from: 2016-02-01 Created: 2016-02-01 Last updated: 2016-04-25Bibliographically approved
Muddukrishna, A., Jonsson, P. A. & Brorsson, M. (2015). Locality-aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors. Scientific Programming, Article ID 981759.
Open this publication in new window or tab >>Locality-aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors
2015 (English)In: Scientific Programming, ISSN 1058-9244, E-ISSN 1875-919X, article id 981759Article in journal (Refereed) Published
Abstract [en]

Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and on manycore processors is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor, and we identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers.

Place, publisher, year, edition, pages
Hindawi Publishing Corporation, 2015
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-166580 (URN)10.1155/2015/981759 (DOI)000364899300001 ()2-s2.0-84947272497 (Scopus ID)
Note

QC 20150615

Available from: 2015-05-11 Created: 2015-05-11 Last updated: 2018-01-11Bibliographically approved
Varisteas, G. & Brorsson, M. (2015). Palirria: accurate on-line parallelism estimation for adaptive work-stealing. Concurrency and Computation
Open this publication in new window or tab >>Palirria: accurate on-line parallelism estimation for adaptive work-stealing
2015 (English)In: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634Article in journal (Refereed) Published
Abstract [en]

We present Palirria, a self-adapting work-stealing scheduling method for nested fork/join parallelism that can be used to estimate the number of utilizable workers and self-adapt accordingly. The estimation mechanism is optimized for accuracy, minimizing the requested resources without degrading performance. We implemented Palirria for both the Linux and Barrelfish operating systems and evaluated it on two platforms: a 48-core Non-Uniform Memory Access (NUMA) multiprocessor and a simulated 32-core system. Compared with state-of-the-art, we observed higher accuracy in estimating resource requirements. This leads to improved resource utilization and performance on par or better to executing with fixed resource allotments.

Place, publisher, year, edition, pages
John Wiley & Sons, 2015
Keywords
multicore, parallel, scheduler, workload, runtime, task, adaptive, resource management, load balancing, work-stealing
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-175460 (URN)10.1002/cpe.3630 (DOI)000369832200020 ()2-s2.0-84956796768 (Scopus ID)
Note

QC 20160307

Available from: 2015-10-15 Created: 2015-10-15 Last updated: 2017-12-01Bibliographically approved
Javed Awan, A., Brorsson, M., Vlassov, V. & Ayguade, E. (2015). Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server. In: Proceedings - 2015 IEEE 5th International Conference on Big Data and Cloud Computing, BDCloud 2015: . Paper presented at Big Data and Cloud Computing (BDCloud), 2015 IEEE Fifth International Conference on, Dalian, China, 26-28 Aug. 2015 (pp. 1-8). IEEE Computer Society, Article ID 7310708.
Open this publication in new window or tab >>Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server
2015 (English)In: Proceedings - 2015 IEEE 5th International Conference on Big Data and Cloud Computing, BDCloud 2015, IEEE Computer Society, 2015, p. 1-8, article id 7310708Conference paper, Published paper (Refereed)
Abstract [en]

In last decade, data analytics have rapidly progressed from traditional disk-based processing tomodern in-memory processing. However, little effort has been devoted at enhancing performance at micro-architecture level. This paper characterizes the performance of in-memory data analytics using Apache Spark framework. We use a single node NUMA machine and identify the bottlenecks hampering the scalability of workloads. We also quantify the inefficiencies at micro-architecture level for various data analysis workloads. Through empirical evaluation, we show that spark workloads do not scale linearly beyond twelve threads, due to work time inflation and thread level load imbalance. Further, at the micro-architecture level, we observe memory bound latency to be the major cause of work time inflation.

Place, publisher, year, edition, pages
IEEE Computer Society, 2015
Keywords
cloud chambers, cloud computing, data analysis, resource allocation, storage management, Apache Spark framework, Spark workload, data analysis workload, disk-based processing, in-memory data analytics, in-memory processing, memory bound latency, microarchitecture level performance, modern cloud server, performance characterization, single node NUMA machine, thread level load imbalance, work time inflation, workload scalability, Benchmark testing, Big data, Data analysis, Instruction sets, Scalability, Servers, Sparks, Data Analytics, NUMA, Spark Performance, Workload Characterization
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-179403 (URN)10.1109/BDCloud.2015.37 (DOI)000380444200001 ()2-s2.0-84962757128 (Scopus ID)978-1-4673-7182-7 (ISBN)
Conference
Big Data and Cloud Computing (BDCloud), 2015 IEEE Fifth International Conference on, Dalian, China, 26-28 Aug. 2015
Note

QC 20160118 QC 20160922

Available from: 2015-12-16 Created: 2015-12-16 Last updated: 2016-09-22Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-9637-2065

Search in DiVA

Show all publications