CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Power and Performance Optimization for Network-on-Chip based Many-Core Processors
KTH, School of Electrical Engineering and Computer Science (EECS).ORCID iD: 0000-0001-9448-5595
2019 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

Network-on-Chip (NoC) is emerging as a critical shared architecture for CMPs (Chip Multi-/Many-Core Processors) running parallel and concurrent applications. As the core count scales up and the transistor size shrinks, how to optimize power and performance for NoC open new research challenges.

As it can potentially consume 20--40\% of the entire chip power, NoC power efficiency has emerged as one of the main design constraints in today's and future high performance CMPs. For NoC power management, we propose a novel on-chip DVFS technique that is able to adjust per-region NoC V/F according to voted V/F levels from communicating threads. A thread periodically votes for a preferred NoC V/F level that best suits its individual performance interests. The final DVFS decision of each region is adjusted by a region DVFS controller democratically based on the majority of votes it receives.

Mutually exclusive locks are pervasive shared memory synchronization primitives. In advanced locks such as the Linux queue spinlock comprising a low-overhead spinning phase and a high-overhead sleeping phase, we show that the lock primitive may create very high competition overhead (COH), which is the time threads compete with each other for the next critical section grant. For performance enhancement, we propose a software-hardware cooperative mechanism that can opportunistically maximize the chance of a thread winning critical section in the low-overhead spinning phase and minimize the chance of winning critical section in the high-overhead sleeping phase, so that COH is significantly reduced. Besides, we further observe that the cache invalidation-acknowledgement round-trip delay between the home node storing the critical section lock and the cores running competing locks can heavily downgrade application performance. To reduce such high lock coherence overhead (LCO), we propose in-network packet generation (iNPG) to turn passive ``normal'' NoC routers into active ``big'' ones that can not only transmit but also generate packets to perform early invalidation and collect inv-acks. iNPG effectively shortens the protocol round-trip delay and thus largely reduces LCO in various locking primitives.

To enhance performance fairness when running multiple multi-threaded programs on a single CMP, we develop the concept of aggregate flow which refers to a sequence of associated data and cache coherence flows issued from the same thread. Based on the aggregate flow concept, we propose three coherent mechanisms to efficiently achieve performance isolation: rate profiling, rate inheritance and flow arbitration. Rate profiling dynamically characterizes thread performance and communication needs. Rate inheritance allows a data or coherence reply flow to inherit the characteristics of its associated data or coherency request flow, so that consistent bandwidth allocation policy is applied to all sub-flows of the same aggregate flow. Flow arbitration uses a proven scheduling policy, self-clocked fair queueing (SCFQ), to achieve rate-proportional arbitration for different aggregate flows. Our approach successfully achieves balanced performance isolations with different mixtures of applications.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2019. , p. 151
Series
TRITA-EECS-AVL ; 44
Keywords [en]
Many-Core Processor; Network-on-Chip; Performance; Power Management; DVFS; Shared Memory Synchronization; Hardware/Software Co-Design; Cache Coherency; Performance Isolation
National Category
Computer Systems
Research subject
Information and Communication Technology
Identifiers
URN: urn:nbn:se:kth:diva-252326ISBN: 978-91-7873-182-4 (print)OAI: oai:DiVA.org:kth-252326DiVA, id: diva2:1318308
Public defence
2019-08-23, Sal B, Electrum 229, Kista, 16440, Sweden, Kista, 09:00 (English)
Opponent
Supervisors
Note

QC 20190528

Available from: 2019-05-28 Created: 2019-05-27 Last updated: 2019-05-28Bibliographically approved

Open Access in DiVA

fulltext(8397 kB)59 downloads
File information
File name FULLTEXT01.pdfFile size 8397 kBChecksum SHA-512
9aa93cac0735c633ec809b53a893ac2a64e7060eca633424c349e752700c1a74b20c4f439a2cdc087900a16d52553ace790c9a7677df4570f551270a02ea389b
Type fulltextMimetype application/pdf

Authority records BETA

Yao, Yuan

Search in DiVA

By author/editor
Yao, Yuan
By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 59 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 736 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf