kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 92) Show all publications
Oz, I., Bhatti, M. K., Popov, K. & Brorsson, M. (2019). Regression-Based Prediction for Task-Based Program Performance. Journal of Circuits, Systems and Computers, 28(4), Article ID 1950060.
Open this publication in new window or tab >>Regression-Based Prediction for Task-Based Program Performance
2019 (English)In: Journal of Circuits, Systems and Computers, ISSN 0218-1266, Vol. 28, no 4, article id 1950060Article in journal (Refereed) Published
Abstract [en]

As multicore systems evolve by increasing the number of parallel execution units, parallel programming models have been released to exploit parallelism in the applications. Task-based programming model uses task abstractions to specify parallel tasks and schedules tasks onto processors at runtime. In order to increase the efficiency and get the highest performance, it is required to identify which runtime configuration is needed and how processor cores must be shared among tasks. Exploring design space for all possible scheduling and runtime options, especially for large input data, becomes infeasible and requires statistical modeling. Regression-based modeling determines the effects of multiple factors on a response variable, and makes predictions based on statistical analysis. In this work, we propose a regression-based modeling approach to predict the task-based program performance for different scheduling parameters with variable data size. We execute a set of task-based programs by varying the runtime parameters, and conduct a systematic measurement for influencing factors on execution time. Our approach uses executions with different configurations for a set of input data, and derives different regression models to predict execution time for larger input data. Our results show that regression models provide accurate predictions for validation inputs with mean error rate as low as 6.3%, and 14% on average among four task-based programs.

Place, publisher, year, edition, pages
WORLD SCIENTIFIC PUBL CO PTE LTD, 2019
Keywords
Performance prediction, task-based programs, regression
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-249799 (URN)10.1142/S0218126619500609 (DOI)000462969800009 ()2-s2.0-85049081368 (Scopus ID)
Note

QC 20190424

Available from: 2019-04-24 Created: 2019-04-24 Last updated: 2022-06-26Bibliographically approved
Du, M., Hammerschmidt, C., Varisteas, G., State, R., Brorsson, M. & Zhang, Z. (2019). Time series modeling of market price in real-time bidding. In: ESANN 2019 - Proceedings, 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning: . Paper presented at 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2019, 24 April 2019 through 26 April 2019 (pp. 643-648). ESANN
Open this publication in new window or tab >>Time series modeling of market price in real-time bidding
Show others...
2019 (English)In: ESANN 2019 - Proceedings, 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN , 2019, p. 643-648Conference paper, Published paper (Refereed)
Abstract [en]

Real-Time-Bidding (RTB) is one of the most popular online advertisement selling mechanisms. Modeling the highly dynamic bidding environment is crucial for making good bids. Market prices of auctions fluctuate heavily within short time spans. State-of-the-art methods neglect the temporal dependencies of bidders’ behaviors. In this paper, the bid requests are aggregated by time and the mean market price per aggregated segment is modeled as a time series. We show that the Long Short Term Memory (LSTM) neural network outperforms the state-of-the-art univariate time series models by capturing the nonlinear temporal dependencies in the market price. We further improve the predicting performance by adding a summary of exogenous features from bid requests.

Place, publisher, year, edition, pages
ESANN, 2019
Keywords
Commerce, Machine learning, Time series, Dynamic biddings, Market price, Online advertisements, State of the art, State-of-the-art methods, Time series modeling, Time span, Univariate time series models, Long short-term memory
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering Economics Computer Systems
Identifiers
urn:nbn:se:kth:diva-301579 (URN)2-s2.0-85071306494 (Scopus ID)
Conference
27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2019, 24 April 2019 through 26 April 2019
Note

Part of ISBN 9782875870650

QC 20210913

Available from: 2021-09-13 Created: 2021-09-13 Last updated: 2024-03-11Bibliographically approved
Bhatti, M. K., Oz, I., Amin, S., Mushtaq, M., Farooq, U., Popov, K. & Brorsson, M. (2018). Locality-aware task scheduling for homogeneous parallel computing systems. Computing, 100(6), 557-595
Open this publication in new window or tab >>Locality-aware task scheduling for homogeneous parallel computing systems
Show others...
2018 (English)In: Computing, ISSN 0010-485X, E-ISSN 1436-5057, Vol. 100, no 6, p. 557-595Article in journal (Refereed) Published
Abstract [en]

In systems with complex many-core cache hierarchy, exploiting data locality can significantly reduce execution time and energy consumption of parallel applications. Locality can be exploited at various hardware and software layers. For instance, by implementing private and shared caches in a multi-level fashion, recent hardware designs are already optimised for locality. However, this would all be useless if the software scheduling does not cast the execution in a manner that promotes locality available in the programs themselves. Since programs for parallel systems consist of tasks executed simultaneously, task scheduling becomes crucial for the performance in multi-level cache architectures. This paper presents a heuristic algorithm for homogeneous multi-core systems called locality-aware task scheduling (LeTS). The LeTS heuristic is a work-conserving algorithm that takes into account both locality and load balancing in order to reduce the execution time of target applications. The working principle of LeTS is based on two distinctive phases, namely; working task group formation phase (WTG-FP) and working task group ordering phase (WTG-OP). The WTG-FP forms groups of tasks in order to capture data reuse across tasks while the WTG-OP determines an optimal order of execution for task groups that minimizes the reuse distance of shared data between tasks. We have performed experiments using randomly generated task graphs by varying three major performance parameters, namely: (1) communication to computation ratio (CCR) between 0.1 and 1.0, (2) application size, i.e., task graphs comprising of 50-, 100-, and 300-tasks per graph, and (3) number of cores with 2-, 4-, 8-, and 16-cores execution scenarios. We have also performed experiments using selected real-world applications. The LeTS heuristic reduces overall execution time of applications by exploiting inter-task data locality. Results show that LeTS outperforms state-of-the-art algorithms in amortizing inter-task communication cost.

Place, publisher, year, edition, pages
Springer, 2018
Keywords
Runtime resource management, Parallel computing, Multicore scheduling, Homogeneous systems, Directed acyclic graph (DAG), Embedded systems
National Category
Other Physics Topics
Identifiers
urn:nbn:se:kth:diva-230491 (URN)10.1007/s00607-017-0581-6 (DOI)000432601500001 ()2-s2.0-85032798462 (Scopus ID)
Note

QC 20180614

Available from: 2018-06-14 Created: 2018-06-14 Last updated: 2022-06-26Bibliographically approved
Javed Awan, A., Ohara, M., Ayguade, E., Ishizaki, K., Brorsson, M. & Vlassov, V. (2017). Identifying the potential of Near Data Processing for Apache Spark. In: Proceedings of the International Symposium on Memory Systems, MEMSYS 2017: . Paper presented at Proceedings of the International Symposium on Memory Systems, MEMSYS 2017, Alexandria, VA, USA, October 02 - 05, 2017 (pp. 60-67). Association for Computing Machinery (ACM), Article ID F131197.
Open this publication in new window or tab >>Identifying the potential of Near Data Processing for Apache Spark
Show others...
2017 (English)In: Proceedings of the International Symposium on Memory Systems, MEMSYS 2017, Association for Computing Machinery (ACM), 2017, p. 60-67, article id F131197Conference paper, Published paper (Refereed)
Abstract [en]

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. There is also a renewed interest in Near Data Processing (NDP) due to technological advancement in the last decade. However, it is not known if NDP architectures can improve the performance of big data processing frameworks such as Apache Spark. In this paper, we build the case of NDP architecture comprising programmable logic based hybrid 2D integrated processing-in-memory and instorage processing for Apache Spark, by extensive profiling of Apache Spark based workloads on Ivy Bridge Server.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2017
Keywords
Processing-in-memory, In-storage Processing, Apache Spark
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Information and Communication Technology
Identifiers
urn:nbn:se:kth:diva-211727 (URN)10.1145/3132402.3132427 (DOI)000557248700006 ()2-s2.0-85033586379 (Scopus ID)
Conference
Proceedings of the International Symposium on Memory Systems, MEMSYS 2017, Alexandria, VA, USA, October 02 - 05, 2017
Note

ISBN for proceedings: 9781450353359

QC 20171124

QC 20210518

Available from: 2017-08-11 Created: 2017-08-11 Last updated: 2023-03-06Bibliographically approved
Du, M., Sassioui, R., Varisteas, G., State, R., Brorsson, M. & Cherkaoui, O. (2017). Improving real-time bidding using a constrained markov decision process. In: 13th International Conference on Advanced Data Mining and Applications, ADMA 2017: . Paper presented at 13th International Conference on Advanced Data Mining and Applications, ADMA 2017, Singapore, 5 November 2017 through 6 November 2017 (pp. 711-726). Springer, 10604
Open this publication in new window or tab >>Improving real-time bidding using a constrained markov decision process
Show others...
2017 (English)In: 13th International Conference on Advanced Data Mining and Applications, ADMA 2017, Springer, 2017, Vol. 10604, p. 711-726Conference paper, Published paper (Refereed)
Abstract [en]

Online advertising is increasingly switching to real-time bidding on advertisement inventory, in which the ad slots are sold through real-time auctions upon users visiting websites or using mobile apps. To compete with unknown bidders in such a highly stochastic environment, each bidder is required to estimate the value of each impression and to set a competitive bid price. Previous bidding algorithms have done so without considering the constraint of budget limits, which we address in this paper. We model the bidding process as a Constrained Markov Decision Process based reinforcement learning framework. Our model uses the predicted click-through-rate as the state, bid price as the action, and ad clicks as the reward. We propose a bidding function, which outperforms the state-of-the-art bidding functions in terms of the number of clicks when the budget limit is low. We further simulate different bidding functions competing in the same environment and report the performances of the bidding strategies when required to adapt to a dynamic environment.

Place, publisher, year, edition, pages
Springer, 2017
Series
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), ISSN 0302-9743 ; 10604
Keywords
Display Advertising, Markov Decision Process, Real-time bidding, Reinforcement Learning
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-218314 (URN)10.1007/978-3-319-69179-4_50 (DOI)000449973300050 ()2-s2.0-85033689734 (Scopus ID)9783319691787 (ISBN)
Conference
13th International Conference on Advanced Data Mining and Applications, ADMA 2017, Singapore, 5 November 2017 through 6 November 2017
Note

QC 20171127

Available from: 2017-11-27 Created: 2017-11-27 Last updated: 2022-06-26Bibliographically approved
Aldinucci, M., Brorsson, M., D'Agostino, D., Daneshtalab, M., Kilpatrick, P. & Leppanen, V. (2017). Preface. The international journal of high performance computing applications, 31(3), 179-180
Open this publication in new window or tab >>Preface
Show others...
2017 (English)In: The international journal of high performance computing applications, ISSN 1094-3420, E-ISSN 1741-2846, Vol. 31, no 3, p. 179-180Article in journal, Editorial material (Refereed) Published
Place, publisher, year, edition, pages
Sage Publications, 2017
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-208829 (URN)10.1177/1094342016680468 (DOI)000401791100001 ()2-s2.0-85019927673 (Scopus ID)
Note

QC 20170614

Available from: 2017-06-14 Created: 2017-06-14 Last updated: 2024-03-15Bibliographically approved
Podobas, A. & Brorsson, M. (2016). Empowering OpenMP with Automatically Generated Hardware. In: International Conference on Embedded Computer Systems: Architectures, MOdeling and Simulation: . Paper presented at SAMOS XVI.
Open this publication in new window or tab >>Empowering OpenMP with Automatically Generated Hardware
2016 (English)In: International Conference on Embedded Computer Systems: Architectures, MOdeling and Simulation, 2016Conference paper, Published paper (Refereed)
Abstract [en]

OpenMP enables productive software development that targets shared-memory general purpose systems. However, OpenMP compilers today have little support for future heterogeneous systems – systems that will more than likely contain Field Programmable Gate Arrays (FPGAs) to compensate for the lack of parallelism available in general purpose systems. We have designed a high-level synthesis flow that automatically generates parallel hardware from unmodified OpenMP programs. The generated hardware is composed of accelerators tailored to act as hardware instances of the OpenMP task primitive. We drive decision making of complex details within accelerators through a constraint-programming model, minimizing the expected input from the (often) hardware-oblivious software developer. We evaluate our system and compare them to two state of the art architectures – the Xeon PHI and the AMD Opteron – where we find our accelerators to perform on par with the two ASIC processors.

Keywords
OpenMP, FPGA, High-Level Synthesis, Tasks, Reconfigurably
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-193705 (URN)10.1109/SAMOS.2016.7818354 (DOI)000399143000032 ()2-s2.0-85013823668 (Scopus ID)
Conference
SAMOS XVI
Note

QC 20161010

Available from: 2016-10-10 Created: 2016-10-10 Last updated: 2024-03-18Bibliographically approved
Muddukrishna, A., Jonsson, P. A., Podobas, A. & Brorsson, M. (2016). Grain Graphs: OpenMP Performance Analysis Made Easy. In: : . Paper presented at 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'16). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Grain Graphs: OpenMP Performance Analysis Made Easy
2016 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Average programmers struggle to solve performance problems in OpenMP programs with tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task performance from the runtime system's perspective where task execution is interleaved with other tasks in an unpredictable order. Problems with OpenMP parallel for-loops are similarly difficult to resolve since tools only visualize aggregate thread-level statistics such as load imbalance without zooming into a per-chunk granularity. The runtime system/threads oriented visualization provides poor support for understanding problems with task and chunk execution time, parallelism, and memory hierarchy utilization, forcing average programmers to rely on experts or use tedious trial-and-error tuning methods for performance. We present grain graphs, a new OpenMP performance analysis method that visualizes grains - computation performed by a task or a parallel for-loop chunk instance - and highlights problems such as low parallelism, work inflation and poor parallelization benefit at the grain level. We demonstrate that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing visualizations in standard OpenMP programs, simplifying OpenMP performance analysis. This enables average programmers to make portable optimizations for poor performing OpenMP programs, reducing pressure on experts and removing the need for tedious trial-and-error tuning.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2016
Keywords
OpenMP, Performance Analysis, Parallel Programming
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-179668 (URN)10.1145/2851141.2851156 (DOI)000393580200029 ()2-s2.0-84963732767 (Scopus ID)
Conference
21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'16)
Note

QC 20170313

Available from: 2015-12-18 Created: 2015-12-18 Last updated: 2024-03-18Bibliographically approved
Awan, A. J., Brorsson, M., Vlassov, V. & Ayguade, E. (2016). Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads. In: : . Paper presented at The 6th IEEE International Conference on Big Data and Cloud Computing (pp. 59-66). IEEE
Open this publication in new window or tab >>Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads
2016 (English)Conference paper, Published paper (Refereed)
Abstract [en]

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare the micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that batch processing and stream processing has same micro-architectural behavior in Spark if the difference between two implementations is of micro-batching only. If the input data rates are small, stream processing workloads are front-end bound. However, the front end bound stalls are reduced at larger input data rates and instruction retirement is improved. Moreover, Spark workloads using DataFrames have improved instruction retirement over workloads using RDDs.

Place, publisher, year, edition, pages
IEEE: , 2016
Keywords
Microarchitectural Performance, Spark Streaming, Workload Characterization
National Category
Computer Systems
Research subject
Information and Communication Technology
Identifiers
urn:nbn:se:kth:diva-196123 (URN)10.1109/BDCloud-SocialCom-SustainCom.2016.20 (DOI)000392516300009 ()2-s2.0-85000885440 (Scopus ID)
Conference
The 6th IEEE International Conference on Big Data and Cloud Computing
Note

QC 20161130

Available from: 2016-11-11 Created: 2016-11-11 Last updated: 2024-03-15Bibliographically approved
Awan, A. J., Brorsson, M., Vlassov, V. & Ayguade, E. (2016). Node architecture implications for in-memory data analytics on scale-in clusters. In: : . Paper presented at 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (pp. 237-246). IEEE Press
Open this publication in new window or tab >>Node architecture implications for in-memory data analytics on scale-in clusters
2016 (English)Conference paper, Published paper (Refereed)
Abstract [en]

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics. Recent studies propose scale-in clusters with in-storage processing devices to process big data analytics with Spark However the proposal is based solely on the memory bandwidth characterization of in-memory data analytics and also does not shed light on the specification of host CPU and memory. Through empirical evaluation of in-memory data analytics with Apache Spark on an Ivy Bridge dual socket server, we have found that (i) simultaneous multi-threading is effective up to 6 cores (ii) data locality on NUMA nodes can improve the performance by 10% on average, (iii) disabling next-line L1-D prefetchers can reduce the execution time by up to 14%, (iv) DDR3 operating at 1333 MT/s is sufficient and (v) multiple small executors can provide up to 36% speedup over single large executor.

Place, publisher, year, edition, pages
IEEE Press, 2016
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-198161 (URN)10.1145/3006299.3006319 (DOI)000408919800026 ()2-s2.0-85013223047 (Scopus ID)
Conference
3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies
Note

QC 20161219

Available from: 2016-12-13 Created: 2016-12-13 Last updated: 2024-03-15Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-9637-2065

Search in DiVA

Show all publications