Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Locality-aware Scheduling and Characterization of Task-based Programs
KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.ORCID iD: 0000-0003-3958-4659
2014 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

Modern computer architectures expose an increasing number of parallel features supported by complex memory access and communication structures. Currently used task scheduling techniques perform poorly since they focus solely on balancing computation load across parallel features and remain oblivious to locality properties of support structures. We contribute with locality-aware task scheduling mechanisms which improve execution time performance on average by 44\% and 11\% respectively on two locality-sensitive architectures - the Tilera TILEPro64 manycore processor and an AMD Opteron 6172 processor based four socket SMP machine.

Programmers need task performance metrics such as amount of task parallelism and task memory hierarchy utilization to analyze performance of task-based programs. However, existing tools indicate performance mainly using thread-centric metrics. Programmers therefore resort to using low-level and tedious thread-centric analysis methods to infer task performance. We contribute with tools and methods to characterize task-based OpenMP programs at the level of tasks using which programmers can quickly understand important properties of the task graph such as critical path and parallelism as well as properties of individual tasks such as instruction count and memory behavior.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2014. , vi, 29 p.
Series
TRITA-ICT-ECS AVH, ISSN 1653-6363 ; 14:01
Keyword [en]
Locality-aware, Task scheduling, OpenMP
National Category
Computer Systems
Research subject
Information and Communication Technology
Identifiers
URN: urn:nbn:se:kth:diva-141124ISBN: 978-91-7501-994-9 (print)OAI: oai:DiVA.org:kth-141124DiVA: diva2:694751
Presentation
2014-03-05, Sal/Hall E, Forum, KTH-ICT, Isafjordsgatan 39, Kista, 10:00 (English)
Opponent
Supervisors
Note

QC 20140212

Available from: 2014-02-12 Created: 2014-02-07 Last updated: 2014-02-12Bibliographically approved
List of papers
1. Task Scheduling on Manycore Processors with Home Caches
Open this publication in new window or tab >>Task Scheduling on Manycore Processors with Home Caches
2013 (English)In: Euro-Par 2012 Workshops, 2013Conference paper, Published paper (Refereed)
Abstract [en]

Modern manycore processors feature a highly scalable and softwareconfigurablecache hierarchy. For performance, manycore programmers will notonly have to efficiently utilize the large number of cores but also understand andconfigure the cache hierarchy to suit the application. Relief from this manycoreprogramming nightmare can be provided by task-based programming modelswhere programmers parallelize using tasks and an architecture-specific runtimesystem maps tasks to cores and in addition configures the cache hierarchy. In thispaper, we focus on the cache hierarchy of the Tilera TILEPro64 processor whichfeatures a software-configurable coherence waypoint called the home cache. Wefirst show the runtime system performance bottleneck of scheduling tasks obliviousto the nature of home caches. We then demonstrate a technique in whichthe runtime system controls the assignment of home caches to memory blocksand schedules tasks to minimize home cache access penalties. Test results of ourtechnique have shown a significant execution time performance improvement onselected benchmarks leading to the conclusion that by taking processor architecturefeatures into account, task-based programming models can indeed providecontinued performance and allow programmers to smoothly transit from the multicoreto manycore era.

Series
LNCS 7640
Keyword
Manycore processor, task scheduling, architecture-aware, runtime system
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-107418 (URN)10.1007/978-3-642-36949-0_39 (DOI)000341240400039 ()2-s2.0-84874439716 (Scopus ID)
Conference
Parallel Processing Workshops, Euro-Par 2012: BDMC 2012, CGWS 2012, HeteroPar 2012, HiBB 2012, OMHI 2012, Paraphrase 2012, PROPER 2012, Resilience 2012, UCHPC 2012, VHPC 2012; Rhodes Island; Greece; 27 August 2012 through 31 August 2012
Projects
ENCORE EU project
Funder
Swedish e‐Science Research CenterICT - The Next Generation
Note

QC 20130108

Available from: 2013-01-08 Created: 2012-12-11 Last updated: 2014-10-03Bibliographically approved
2. Locality-aware Task Scheduling and Data Distribution on NUMA Systems
Open this publication in new window or tab >>Locality-aware Task Scheduling and Data Distribution on NUMA Systems
2013 (English)In: OpenMP in the Era of Low Power Devices and Accelerators: 9th International Workshop on OpenMP, IWOMP 2013, Canberra, Australia, September 16-18, 2013 / [ed] Alistair P Rendell, Barbara M. Chapman, Matthias S.Müller, Springer Science+Business Media B.V., 2013Conference paper, Published paper (Refereed)
Abstract [en]

Modern parallel computer systems exhibit Non-Uniform Memory Access (NUMA) behavior. For best performance, any parallel program therefore has to match data allocation and scheduling of computations to the memory architecture of the machine. When done manually, this becomes a tedious process and since each individual system has its own peculiarities this also leads to programs that are not performance-portable.

We propose the use of a data distribution scheme in which NUMA hardware peculiarities are abstracted away from the programmer and data distribution is delegated to a runtime system which is generated once for each machine. In addition we propose using task data dependence information now possible with the OpenMP 4.0RC2 proposal to guide the scheduling of OpenMP tasks to further reduce data stall times.

We demonstrate the viability and performance of our proposals on a four socket AMD Opteron machine with eight NUMA nodes. We identify that both data distribution and locality-aware task scheduling improves performance compared to default policies while still providing an architecture-oblivious approach for the programmer.

Place, publisher, year, edition, pages
Springer Science+Business Media B.V., 2013
Series
Lecture Notes in Computer Science, 8122
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-124881 (URN)10.1007/978-3-642-40698-0_12 (DOI)2-s2.0-84883296523 (Scopus ID)978-3-642-40697-3 (ISBN)
Conference
International Workshop on OpenMP (IWOMP),September 16-18 2013, Canberra, Australia
Note

QC 20130924

Available from: 2013-08-01 Created: 2013-08-01 Last updated: 2014-02-12Bibliographically approved
3. Characterizing task-based OpenMP programs
Open this publication in new window or tab >>Characterizing task-based OpenMP programs
2015 (English)In: PLoS ONE, ISSN 1932-6203, E-ISSN 1932-6203, Vol. 10, no 4, e0123545- p.Article in journal (Refereed) Published
Abstract [en]

Programmers struggle to understand performance of task-based OpenMP programs since profiling tools only report thread-based performance. Performance tuning also requires task-based performance in order to balance per-task memory hierarchy utilization against exposed task parallelism. We provide a cost-effective method to extract detailed task-based performance information from OpenMP programs. We demonstrate the utility of our method by quickly diagnosing performance problems and characterizing exposed task parallelism and per-task instruction profiles of benchmarks in the widely-used Barcelona OpenMP Tasks Suite. Programmers can tune performance faster and understand performance tradeoffs more effectively than existing tools by using our method to characterize task-based performance.

Keyword
Scheduling Strategies, Performance Analysis, Benchmark
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-141201 (URN)10.1371/journal.pone.0123545 (DOI)000352590300104 ()25860023 (PubMedID)2-s2.0-84929498034 (Scopus ID)
Note

QC 20150623. Updated from "Manuscript" to "Article".

Available from: 2014-02-12 Created: 2014-02-12 Last updated: 2017-12-06Bibliographically approved

Open Access in DiVA

lic-thesis-AM(1363 kB)1430 downloads
File information
File name FULLTEXT01.pdfFile size 1363 kBChecksum SHA-512
8a32dce3222d743b29f83c16647ed506ee30fe14aa0729e0f0de6907b6208df101e29e8308209c46cad169167a0f66e334903f88a38196c6aaf8a168a6b4b800
Type fulltextMimetype application/pdf

Authority records BETA

Muddukrishna, Ananya

Search in DiVA

By author/editor
Muddukrishna, Ananya
By organisation
Software and Computer systems, SCS
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 1430 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 3132 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf