Endre søk
Link to record
Permanent link

Direct link
Publikasjoner (10 av 92) Visa alla publikasjoner
Oz, I., Bhatti, M. K., Popov, K. & Brorsson, M. (2019). Regression-Based Prediction for Task-Based Program Performance. Journal of Circuits, Systems and Computers, 28(4), Article ID 1950060.
Åpne denne publikasjonen i ny fane eller vindu >>Regression-Based Prediction for Task-Based Program Performance
2019 (engelsk)Inngår i: Journal of Circuits, Systems and Computers, ISSN 0218-1266, Vol. 28, nr 4, artikkel-id 1950060Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

As multicore systems evolve by increasing the number of parallel execution units, parallel programming models have been released to exploit parallelism in the applications. Task-based programming model uses task abstractions to specify parallel tasks and schedules tasks onto processors at runtime. In order to increase the efficiency and get the highest performance, it is required to identify which runtime configuration is needed and how processor cores must be shared among tasks. Exploring design space for all possible scheduling and runtime options, especially for large input data, becomes infeasible and requires statistical modeling. Regression-based modeling determines the effects of multiple factors on a response variable, and makes predictions based on statistical analysis. In this work, we propose a regression-based modeling approach to predict the task-based program performance for different scheduling parameters with variable data size. We execute a set of task-based programs by varying the runtime parameters, and conduct a systematic measurement for influencing factors on execution time. Our approach uses executions with different configurations for a set of input data, and derives different regression models to predict execution time for larger input data. Our results show that regression models provide accurate predictions for validation inputs with mean error rate as low as 6.3%, and 14% on average among four task-based programs.

sted, utgiver, år, opplag, sider
WORLD SCIENTIFIC PUBL CO PTE LTD, 2019
Emneord
Performance prediction, task-based programs, regression
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-249799 (URN)10.1142/S0218126619500609 (DOI)000462969800009 ()2-s2.0-85049081368 (Scopus ID)
Merknad

QC 20190424

Tilgjengelig fra: 2019-04-24 Laget: 2019-04-24 Sist oppdatert: 2022-06-26bibliografisk kontrollert
Du, M., Hammerschmidt, C., Varisteas, G., State, R., Brorsson, M. & Zhang, Z. (2019). Time series modeling of market price in real-time bidding. In: ESANN 2019 - Proceedings, 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning: . Paper presented at 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2019, 24 April 2019 through 26 April 2019 (pp. 643-648). ESANN
Åpne denne publikasjonen i ny fane eller vindu >>Time series modeling of market price in real-time bidding
Vise andre…
2019 (engelsk)Inngår i: ESANN 2019 - Proceedings, 27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN , 2019, s. 643-648Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Real-Time-Bidding (RTB) is one of the most popular online advertisement selling mechanisms. Modeling the highly dynamic bidding environment is crucial for making good bids. Market prices of auctions fluctuate heavily within short time spans. State-of-the-art methods neglect the temporal dependencies of bidders’ behaviors. In this paper, the bid requests are aggregated by time and the mean market price per aggregated segment is modeled as a time series. We show that the Long Short Term Memory (LSTM) neural network outperforms the state-of-the-art univariate time series models by capturing the nonlinear temporal dependencies in the market price. We further improve the predicting performance by adding a summary of exogenous features from bid requests.

sted, utgiver, år, opplag, sider
ESANN, 2019
Emneord
Commerce, Machine learning, Time series, Dynamic biddings, Market price, Online advertisements, State of the art, State-of-the-art methods, Time series modeling, Time span, Univariate time series models, Long short-term memory
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-301579 (URN)2-s2.0-85071306494 (Scopus ID)
Konferanse
27th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2019, 24 April 2019 through 26 April 2019
Merknad

Part of ISBN 9782875870650

QC 20210913

Tilgjengelig fra: 2021-09-13 Laget: 2021-09-13 Sist oppdatert: 2024-03-11bibliografisk kontrollert
Bhatti, M. K., Oz, I., Amin, S., Mushtaq, M., Farooq, U., Popov, K. & Brorsson, M. (2018). Locality-aware task scheduling for homogeneous parallel computing systems. Computing, 100(6), 557-595
Åpne denne publikasjonen i ny fane eller vindu >>Locality-aware task scheduling for homogeneous parallel computing systems
Vise andre…
2018 (engelsk)Inngår i: Computing, ISSN 0010-485X, E-ISSN 1436-5057, Vol. 100, nr 6, s. 557-595Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

In systems with complex many-core cache hierarchy, exploiting data locality can significantly reduce execution time and energy consumption of parallel applications. Locality can be exploited at various hardware and software layers. For instance, by implementing private and shared caches in a multi-level fashion, recent hardware designs are already optimised for locality. However, this would all be useless if the software scheduling does not cast the execution in a manner that promotes locality available in the programs themselves. Since programs for parallel systems consist of tasks executed simultaneously, task scheduling becomes crucial for the performance in multi-level cache architectures. This paper presents a heuristic algorithm for homogeneous multi-core systems called locality-aware task scheduling (LeTS). The LeTS heuristic is a work-conserving algorithm that takes into account both locality and load balancing in order to reduce the execution time of target applications. The working principle of LeTS is based on two distinctive phases, namely; working task group formation phase (WTG-FP) and working task group ordering phase (WTG-OP). The WTG-FP forms groups of tasks in order to capture data reuse across tasks while the WTG-OP determines an optimal order of execution for task groups that minimizes the reuse distance of shared data between tasks. We have performed experiments using randomly generated task graphs by varying three major performance parameters, namely: (1) communication to computation ratio (CCR) between 0.1 and 1.0, (2) application size, i.e., task graphs comprising of 50-, 100-, and 300-tasks per graph, and (3) number of cores with 2-, 4-, 8-, and 16-cores execution scenarios. We have also performed experiments using selected real-world applications. The LeTS heuristic reduces overall execution time of applications by exploiting inter-task data locality. Results show that LeTS outperforms state-of-the-art algorithms in amortizing inter-task communication cost.

sted, utgiver, år, opplag, sider
Springer, 2018
Emneord
Runtime resource management, Parallel computing, Multicore scheduling, Homogeneous systems, Directed acyclic graph (DAG), Embedded systems
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-230491 (URN)10.1007/s00607-017-0581-6 (DOI)000432601500001 ()2-s2.0-85032798462 (Scopus ID)
Merknad

QC 20180614

Tilgjengelig fra: 2018-06-14 Laget: 2018-06-14 Sist oppdatert: 2022-06-26bibliografisk kontrollert
Javed Awan, A., Ohara, M., Ayguade, E., Ishizaki, K., Brorsson, M. & Vlassov, V. (2017). Identifying the potential of Near Data Processing for Apache Spark. In: Proceedings of the International Symposium on Memory Systems, MEMSYS 2017: . Paper presented at Proceedings of the International Symposium on Memory Systems, MEMSYS 2017, Alexandria, VA, USA, October 02 - 05, 2017 (pp. 60-67). Association for Computing Machinery (ACM), Article ID F131197.
Åpne denne publikasjonen i ny fane eller vindu >>Identifying the potential of Near Data Processing for Apache Spark
Vise andre…
2017 (engelsk)Inngår i: Proceedings of the International Symposium on Memory Systems, MEMSYS 2017, Association for Computing Machinery (ACM), 2017, s. 60-67, artikkel-id F131197Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. There is also a renewed interest in Near Data Processing (NDP) due to technological advancement in the last decade. However, it is not known if NDP architectures can improve the performance of big data processing frameworks such as Apache Spark. In this paper, we build the case of NDP architecture comprising programmable logic based hybrid 2D integrated processing-in-memory and instorage processing for Apache Spark, by extensive profiling of Apache Spark based workloads on Ivy Bridge Server.

sted, utgiver, år, opplag, sider
Association for Computing Machinery (ACM), 2017
Emneord
Processing-in-memory, In-storage Processing, Apache Spark
HSV kategori
Forskningsprogram
Informations- och kommunikationsteknik
Identifikatorer
urn:nbn:se:kth:diva-211727 (URN)10.1145/3132402.3132427 (DOI)000557248700006 ()2-s2.0-85033586379 (Scopus ID)
Konferanse
Proceedings of the International Symposium on Memory Systems, MEMSYS 2017, Alexandria, VA, USA, October 02 - 05, 2017
Merknad

ISBN for proceedings: 9781450353359

QC 20171124

QC 20210518

Tilgjengelig fra: 2017-08-11 Laget: 2017-08-11 Sist oppdatert: 2023-03-06bibliografisk kontrollert
Du, M., Sassioui, R., Varisteas, G., State, R., Brorsson, M. & Cherkaoui, O. (2017). Improving real-time bidding using a constrained markov decision process. In: 13th International Conference on Advanced Data Mining and Applications, ADMA 2017: . Paper presented at 13th International Conference on Advanced Data Mining and Applications, ADMA 2017, Singapore, 5 November 2017 through 6 November 2017 (pp. 711-726). Springer, 10604
Åpne denne publikasjonen i ny fane eller vindu >>Improving real-time bidding using a constrained markov decision process
Vise andre…
2017 (engelsk)Inngår i: 13th International Conference on Advanced Data Mining and Applications, ADMA 2017, Springer, 2017, Vol. 10604, s. 711-726Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Online advertising is increasingly switching to real-time bidding on advertisement inventory, in which the ad slots are sold through real-time auctions upon users visiting websites or using mobile apps. To compete with unknown bidders in such a highly stochastic environment, each bidder is required to estimate the value of each impression and to set a competitive bid price. Previous bidding algorithms have done so without considering the constraint of budget limits, which we address in this paper. We model the bidding process as a Constrained Markov Decision Process based reinforcement learning framework. Our model uses the predicted click-through-rate as the state, bid price as the action, and ad clicks as the reward. We propose a bidding function, which outperforms the state-of-the-art bidding functions in terms of the number of clicks when the budget limit is low. We further simulate different bidding functions competing in the same environment and report the performances of the bidding strategies when required to adapt to a dynamic environment.

sted, utgiver, år, opplag, sider
Springer, 2017
Serie
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), ISSN 0302-9743 ; 10604
Emneord
Display Advertising, Markov Decision Process, Real-time bidding, Reinforcement Learning
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-218314 (URN)10.1007/978-3-319-69179-4_50 (DOI)000449973300050 ()2-s2.0-85033689734 (Scopus ID)9783319691787 (ISBN)
Konferanse
13th International Conference on Advanced Data Mining and Applications, ADMA 2017, Singapore, 5 November 2017 through 6 November 2017
Merknad

QC 20171127

Tilgjengelig fra: 2017-11-27 Laget: 2017-11-27 Sist oppdatert: 2022-06-26bibliografisk kontrollert
Aldinucci, M., Brorsson, M., D'Agostino, D., Daneshtalab, M., Kilpatrick, P. & Leppanen, V. (2017). Preface. The international journal of high performance computing applications, 31(3), 179-180
Åpne denne publikasjonen i ny fane eller vindu >>Preface
Vise andre…
2017 (engelsk)Inngår i: The international journal of high performance computing applications, ISSN 1094-3420, E-ISSN 1741-2846, Vol. 31, nr 3, s. 179-180Artikkel i tidsskrift, Editorial material (Fagfellevurdert) Published
sted, utgiver, år, opplag, sider
Sage Publications, 2017
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-208829 (URN)10.1177/1094342016680468 (DOI)000401791100001 ()2-s2.0-85019927673 (Scopus ID)
Merknad

QC 20170614

Tilgjengelig fra: 2017-06-14 Laget: 2017-06-14 Sist oppdatert: 2024-03-15bibliografisk kontrollert
Podobas, A. & Brorsson, M. (2016). Empowering OpenMP with Automatically Generated Hardware. In: International Conference on Embedded Computer Systems: Architectures, MOdeling and Simulation: . Paper presented at SAMOS XVI.
Åpne denne publikasjonen i ny fane eller vindu >>Empowering OpenMP with Automatically Generated Hardware
2016 (engelsk)Inngår i: International Conference on Embedded Computer Systems: Architectures, MOdeling and Simulation, 2016Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

OpenMP enables productive software development that targets shared-memory general purpose systems. However, OpenMP compilers today have little support for future heterogeneous systems – systems that will more than likely contain Field Programmable Gate Arrays (FPGAs) to compensate for the lack of parallelism available in general purpose systems. We have designed a high-level synthesis flow that automatically generates parallel hardware from unmodified OpenMP programs. The generated hardware is composed of accelerators tailored to act as hardware instances of the OpenMP task primitive. We drive decision making of complex details within accelerators through a constraint-programming model, minimizing the expected input from the (often) hardware-oblivious software developer. We evaluate our system and compare them to two state of the art architectures – the Xeon PHI and the AMD Opteron – where we find our accelerators to perform on par with the two ASIC processors.

Emneord
OpenMP, FPGA, High-Level Synthesis, Tasks, Reconfigurably
HSV kategori
Forskningsprogram
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-193705 (URN)10.1109/SAMOS.2016.7818354 (DOI)000399143000032 ()2-s2.0-85013823668 (Scopus ID)
Konferanse
SAMOS XVI
Merknad

QC 20161010

Tilgjengelig fra: 2016-10-10 Laget: 2016-10-10 Sist oppdatert: 2024-03-18bibliografisk kontrollert
Muddukrishna, A., Jonsson, P. A., Podobas, A. & Brorsson, M. (2016). Grain Graphs: OpenMP Performance Analysis Made Easy. In: : . Paper presented at 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'16). Association for Computing Machinery (ACM)
Åpne denne publikasjonen i ny fane eller vindu >>Grain Graphs: OpenMP Performance Analysis Made Easy
2016 (engelsk)Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Average programmers struggle to solve performance problems in OpenMP programs with tasks and parallel for-loops. Existing performance analysis tools visualize OpenMP task performance from the runtime system's perspective where task execution is interleaved with other tasks in an unpredictable order. Problems with OpenMP parallel for-loops are similarly difficult to resolve since tools only visualize aggregate thread-level statistics such as load imbalance without zooming into a per-chunk granularity. The runtime system/threads oriented visualization provides poor support for understanding problems with task and chunk execution time, parallelism, and memory hierarchy utilization, forcing average programmers to rely on experts or use tedious trial-and-error tuning methods for performance. We present grain graphs, a new OpenMP performance analysis method that visualizes grains - computation performed by a task or a parallel for-loop chunk instance - and highlights problems such as low parallelism, work inflation and poor parallelization benefit at the grain level. We demonstrate that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing visualizations in standard OpenMP programs, simplifying OpenMP performance analysis. This enables average programmers to make portable optimizations for poor performing OpenMP programs, reducing pressure on experts and removing the need for tedious trial-and-error tuning.

sted, utgiver, år, opplag, sider
Association for Computing Machinery (ACM), 2016
Emneord
OpenMP, Performance Analysis, Parallel Programming
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-179668 (URN)10.1145/2851141.2851156 (DOI)000393580200029 ()2-s2.0-84963732767 (Scopus ID)
Konferanse
21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP'16)
Merknad

QC 20170313

Tilgjengelig fra: 2015-12-18 Laget: 2015-12-18 Sist oppdatert: 2024-03-18bibliografisk kontrollert
Awan, A. J., Brorsson, M., Vlassov, V. & Ayguade, E. (2016). Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads. In: : . Paper presented at The 6th IEEE International Conference on Big Data and Cloud Computing (pp. 59-66). IEEE
Åpne denne publikasjonen i ny fane eller vindu >>Micro-architectural Characterization of Apache Spark on Batch and Stream Processing Workloads
2016 (engelsk)Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. However, recent studies on micro-architectural characterization of in-memory data analytics are limited to only batch processing workloads. We compare the micro-architectural performance of batch processing and stream processing workloads in Apache Spark using hardware performance counters on a dual socket server. In our evaluation experiments, we have found that batch processing and stream processing has same micro-architectural behavior in Spark if the difference between two implementations is of micro-batching only. If the input data rates are small, stream processing workloads are front-end bound. However, the front end bound stalls are reduced at larger input data rates and instruction retirement is improved. Moreover, Spark workloads using DataFrames have improved instruction retirement over workloads using RDDs.

sted, utgiver, år, opplag, sider
IEEE: , 2016
Emneord
Microarchitectural Performance, Spark Streaming, Workload Characterization
HSV kategori
Forskningsprogram
Informations- och kommunikationsteknik
Identifikatorer
urn:nbn:se:kth:diva-196123 (URN)10.1109/BDCloud-SocialCom-SustainCom.2016.20 (DOI)000392516300009 ()2-s2.0-85000885440 (Scopus ID)
Konferanse
The 6th IEEE International Conference on Big Data and Cloud Computing
Merknad

QC 20161130

Tilgjengelig fra: 2016-11-11 Laget: 2016-11-11 Sist oppdatert: 2024-03-15bibliografisk kontrollert
Awan, A. J., Brorsson, M., Vlassov, V. & Ayguade, E. (2016). Node architecture implications for in-memory data analytics on scale-in clusters. In: : . Paper presented at 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (pp. 237-246). IEEE Press
Åpne denne publikasjonen i ny fane eller vindu >>Node architecture implications for in-memory data analytics on scale-in clusters
2016 (engelsk)Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics. Recent studies propose scale-in clusters with in-storage processing devices to process big data analytics with Spark However the proposal is based solely on the memory bandwidth characterization of in-memory data analytics and also does not shed light on the specification of host CPU and memory. Through empirical evaluation of in-memory data analytics with Apache Spark on an Ivy Bridge dual socket server, we have found that (i) simultaneous multi-threading is effective up to 6 cores (ii) data locality on NUMA nodes can improve the performance by 10% on average, (iii) disabling next-line L1-D prefetchers can reduce the execution time by up to 14%, (iv) DDR3 operating at 1333 MT/s is sufficient and (v) multiple small executors can provide up to 36% speedup over single large executor.

sted, utgiver, år, opplag, sider
IEEE Press, 2016
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-198161 (URN)10.1145/3006299.3006319 (DOI)000408919800026 ()2-s2.0-85013223047 (Scopus ID)
Konferanse
3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies
Merknad

QC 20161219

Tilgjengelig fra: 2016-12-13 Laget: 2016-12-13 Sist oppdatert: 2024-03-15bibliografisk kontrollert
Organisasjoner
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0002-9637-2065