Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Ultra-low-power Design and Implementation of Application-specific Instruction-set Processors for Ubiquitous Sensing and Computing
KTH, School of Information and Communication Technology (ICT), Industrial and Medical Electronics. KTH, School of Information and Communication Technology (ICT), Centres, VinnExcellence Center for Intelligence in Paper and Packaging, iPACK.ORCID iD: 0000-0002-7589-9749
2015 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The feature size of transistors keeps shrinking with the development of technology, which enables ubiquitous sensing and computing. However, with the break down of Dennard scaling caused by the difficulties for further lowering supply voltage, the power density increases significantly. The consequence is that, for a given power budget, the energy efficiency must be improved for hardware resources to maximize the performance. Application-specific integrated circuits (ASICs) obtain high energy efficiency at the cost of low flexibility for various applications, while general-purpose processors (GPPs) gain generality at the expense of efficiency.

To provide both high energy efficiency and flexibility, this dissertation explores the ultra-low-power design of application-specific instruction-set processors (ASIP) for ubiquitous sensing and computing. Two application scenarios, i.e. high-throughput compute-intensive processing for multimedia and low-throughput low-cost processing for Internet of Things (IoT) are implemented in the proposed ASIPs.

Multimedia stream processing for human-computer interaction is always featured with high data throughput. To design processors for networked multimedia streams, customizing application-specific accelerators controlled by the embedded processor is exploited. By abstracting the common features from multiple coding algorithms, video decoding accelerators are implemented for networked multi-standard multimedia stream processing. Fabricated in 0.13 $\mu$m CMOS technology, the processor running at 216 MHz is capable of decoding real-time high-definition video streams with power consumption of 414 mW.

When even higher throughput is required, such as in multi-view video coding applications, multiple customized processors will be connected with an on-chip network. Design problems are further studied for selecting the capability of single processors, the number of processors, the capacity of communication network, as well as the task assignment schemes.

In the IoT scenario, low processing throughput but high energy efficiency and adaptability are demanded for a wide spectrum of devices. In this case, a tile processor including a multi-mode router and dual cores is proposed and implemented. The multi-mode router supports both circuit and wormhole switching to facilitate inter-silicon extension for providing on-demand performance. The control-centric dual-core architecture uses control words to directly manipulate all hardware resources. Such a mechanism avoids introducing complex control logics, and the hardware utilization is increased. Programmable control words enable reconfigurability of the processor for supporting general-purpose ISAs, application-specific instructions and dedicated implementations. The idea of reducing global data transfer also increases the energy efficiency. Finally, a single tile processor together with network of bare dies and network of packaged chips has been demonstrated as the result. The processor implemented in 65 nm low leakage CMOS technology and achieves the energy efficiency of 101.4 GOPS/W for each core.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2015. , xvi, 74 p.
Series
TRITA-ICT, ISSN 1653-6363 ; 15:11
National Category
Computer Systems
Identifiers
URN: urn:nbn:se:kth:diva-174896ISBN: 978-91-7595-692-3 (print)OAI: oai:DiVA.org:kth-174896DiVA: diva2:859802
Public defence
2015-11-04, Sal B, Electrum 229, Kista, 10:00 (English)
Opponent
Supervisors
Funder
VINNOVA
Note

QC 20151009

Available from: 2015-10-09 Created: 2015-10-08 Last updated: 2015-10-09Bibliographically approved
List of papers
1. System design of full HD MVC decoding on mesh-based multicore NoCs
Open this publication in new window or tab >>System design of full HD MVC decoding on mesh-based multicore NoCs
2011 (English)In: Microprocessors and microsystems, ISSN 0141-9331, E-ISSN 1872-9436, Vol. 35, no 2, 217-229 p.Article in journal (Refereed) Published
Abstract [en]

Future multimedia applications such as full HD (1920 x 1080) multiview video coding (MVC) present great challenges on computing architectures. Even if with the state-of-the-art ASIC technology which can process single view HD decoding, dealing with multiple views would require times of computation capacity in proportion to the number of views, which is difficult to achieve. In this paper, we explore the system-level design space for full HD MVC applications mapped onto mesh-based multicore Network-on-Chip (NoC) architectures. To this end, we establish a simulation framework capable of simulating the combination of communication networks with computing cores. We investigate two task assignment schemes: picture-level assignment and view-level assignment. With an eight-view MVC decoding, we explore the design options with respect to network size, single-core performance and link bandwidth under both task assignment schemes. Our studies show that, to achieve a certain decoding performance, the computation capability and communication capacity should be balanced in the system. Also, to realize the eight-view HD decoding, the system only requires twice or less than twice of the single-core processing capacity required by single view decoding, thanks to the parallel computation and communication enabled by the multicore NoC architectures. Our results exhibit feasibility and potential of efficiently implementing the full HD MVC decoding on multicore NoC architectures.

Keyword
Application-specific, Homogeneous NoC, Exploration framework, Full HD MVC decoding, Multicore architecture, Communication and computation
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-32609 (URN)10.1016/j.micpro.2010.10.003 (DOI)000288729800012 ()2-s2.0-79951873738 (Scopus ID)
Note
QC 20110419Available from: 2011-04-19 Created: 2011-04-18 Last updated: 2017-12-11Bibliographically approved
2. Design and Implementation of Multi-mode Routers for Large-scale Inter-core Networks
Open this publication in new window or tab >>Design and Implementation of Multi-mode Routers for Large-scale Inter-core Networks
2016 (English)In: Integration, ISSN 0167-9260, E-ISSN 1872-7522, Vol. 53, 1-13 p.Article in journal (Other academic) Published
Abstract [en]

Constructing on-chip or inter-silicon (inter-die/inter-chip) networks to connect multiple processors extends the system capability and scalability. It is a key issue to implement a flexible router that can fit into various application scenarios. This paper proposes a multi-mode adaptable router that can support both circuit and wormhole switching with supplying flexible working strategies for specific traffic patterns in diverse applications. The limitation of mono-mode switched routers is shown at first, followed by algorithm exploration in the proposed router for choosing the proper working strategy in a specific network. We then present the performance improvement when applying the mixed circuit/wormhole switching mode to different applications, and analyze the image decoding as a case study. The multi-mode router has been implemented with different configurations in a 65 nm CMOS technology. The one with 8-bit flit width is demonstrated together with a multi-core processor to show the feasibility. Working at 350 MHz, the average power consumption of the whole system is 22 mW.

Place, publisher, year, edition, pages
Elsevier, 2016
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-169545 (URN)10.1016/j.vlsi.2015.10.002 (DOI)000373551600001 ()2-s2.0-84960113542 (Scopus ID)
Note

QC 20160413

Available from: 2015-06-16 Created: 2015-06-16 Last updated: 2017-12-04Bibliographically approved
3. A 5Mgate/414mW Networked Media SoC in 0.13um CMOS with 720p Multi-Standard Video Decoding
Open this publication in new window or tab >>A 5Mgate/414mW Networked Media SoC in 0.13um CMOS with 720p Multi-Standard Video Decoding
Show others...
2009 (English)In: 2009 IEEE ASIAN SOLID-STATE CIRCUITS CONFERENCE (A-SSCC), IEEE Solid-State Circuits Society, 2009, 385-388 p.Conference paper, Published paper (Refereed)
Abstract [en]

A flexible and high performance SoC is developed for networked media applications by integrating two RISC cores, Ethernet network interface and coarse-grained configurable video decoding unit. Real-time 1280x720@25fps MPEG-2/MPEG-4/RealVideo decoding is achieved for on-line video streams. The SoC is fabricated in 0.13um single-poly eight-metal CMOS technology with core size of 6.4mm * 6.4mm. To achieve low power design, flexible power management strategy is implemented for dynamically control of computational capabilities with various workloads. The maximum power consumption is 414mW at 1.2V supply voltage with the corresponding system frequency of 216MHz, when real-time HD (1280x720@25fps) video streams are decoded. When the SoC decodes real-time CIF (352x288@25fps) video streams, it requires 27MHz system frequency and consumes 95mW.

Place, publisher, year, edition, pages
IEEE Solid-State Circuits Society, 2009
Series
IEEE Asian Solid-State Circuits Conference Proceedings of Technical Papers
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-62206 (URN)10.1109/ASSCC.2009.5357177 (DOI)000298194200097 ()2-s2.0-76249116270 (Scopus ID)978-1-4244-4434-2 (ISBN)
Conference
“”, IEEE Asian Solid-State Circuits Conference (ASSCC), Taipei, TAIWAN, NOV 16-18, 2009
Note
QC 20120224Available from: 2012-01-18 Created: 2012-01-18 Last updated: 2015-10-09Bibliographically approved
4. An ASIC-Design-Based Configurable SOC Architecture for Networked Media
Open this publication in new window or tab >>An ASIC-Design-Based Configurable SOC Architecture for Networked Media
2008 (English)In: 2008 INTERNATIONAL SYMPOSIUM ON SYSTEM-ON-CHIP, PROCEEDINGS / [ed] Nurmi J, Takala J, Vainio O, NEW YORK: IEEE , 2008, 41-44 p.Conference paper, Published paper (Refereed)
Abstract [en]

An ASIC-design-based configurable SOC architecture, which is high performance, flexible, programmable, and compiler-independent, is designed for networked media applications. A coarse-grained parallel computing mechanism is employed in this architecture. Mapping this architecture to a specific application is demonstrated through an example in multimedia application. The design is validated in a powerful FPGA, consisting of two CPUs, working at 81MHz and five function units, working at 40.5MHz.

Place, publisher, year, edition, pages
NEW YORK: IEEE, 2008
National Category
Electrical Engineering, Electronic Engineering, Information Engineering Computer and Information Science
Identifiers
urn:nbn:se:kth:diva-31540 (URN)10.1109/ISSOC.2008.4694877 (DOI)000262647700010 ()2-s2.0-67249088894 (Scopus ID)978-1-4244-2541-9 (ISBN)
Conference
10th Annual International Symposium on System-on-Chip Tampere, FINLAND, 2008
Note
QC 20110405Available from: 2011-04-05 Created: 2011-03-18 Last updated: 2015-10-09Bibliographically approved
5. System-level exploration of mesh-based NoC architectures for multimedia applications
Open this publication in new window or tab >>System-level exploration of mesh-based NoC architectures for multimedia applications
2010 (English)In: Proceedings - IEEE International SOC Conference, SOCC 2010, IEEE , 2010, 99-104 p.Conference paper, Published paper (Refereed)
Abstract [en]

This paper explores the design space of mesh-based NoC architectures for multimedia applications. Via the systemlevel exploration, we intend to address one crucial question: given a multimedia application and node processing throughput, how to determine the optimal size of the mesh and link bandwidth? Equivalently we shall answer another question: given a multimedia application and its mapping to a given-sized mesh, how to optimally dimension the node processing throughput and link bandwidth? Based on the process models of an application, we analyze the computation and communication demand when mapping the application onto a circuit-switched mesh NoC. We also propose a simulation approach to verify our analysis and further explore the system-level design decisions. Experiments and a case study on a multiview video coding application validate our approach.

Place, publisher, year, edition, pages
IEEE, 2010
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-62212 (URN)10.1109/SOCC.2010.5784728 (DOI)2-s2.0-79960726345 (Scopus ID)978-142446683-2 (ISBN)
Conference
23rd IEEE International SOC Conference, SOCC 2010; Las Vegas, NV; 27 September 2010 through 29 September 2010
Note
QC 20120223Available from: 2012-01-18 Created: 2012-01-18 Last updated: 2015-10-09Bibliographically approved
6. Implementing MVC Decoding on Homogeneous NoCs: Circuit Switching or Wormhole Switching
Open this publication in new window or tab >>Implementing MVC Decoding on Homogeneous NoCs: Circuit Switching or Wormhole Switching
2015 (English)Conference paper, Published paper (Refereed)
Abstract [en]

To implement multiview video decoding on network on-chip (NoC) based homogeneous multicore architectures, the selection of switching techniques for routers is one of the most important aspects for design space exploration. Circuit switching and wormhole switching are two most feasible switching techniques for on-chip networks. To choose the suitable switching technique, we perform the comparison on decoding speed of the whole system, link utilization and delay between circuit switching and wormhole switching for implementing eight-view QVGA video decoding on 4 × 4 NoCs at 30 fps. The required link bandwidths are both around 800 Mbps with the similar network utilization and delay. We conclude that, to implement multiview video decoding on homogeneous NoCs, circuit switching is more suitable considering the similar performance and lower cost compared with wormhole switching.

Place, publisher, year, edition, pages
IEEE, 2015
Series
23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), ISSN 1066-6192
Keyword
multiprocessing systems;network-on-chip;video coding;MVC decoding;QVGA video decoding;circuit switching;design space exploration;homogeneous NoCs;multicore architectures;multiview video decoding;network on-chip;network utilization;onchip networks;switching techniques;wormhole switching;Decoding;Delays;Microprocessors;Switches;Switching circuits;Very large scale integration
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-169538 (URN)10.1109/PDP.2015.48 (DOI)000380471500058 ()2-s2.0-84962826184 (Scopus ID)
External cooperation:
Conference
23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)
Note

QC 20150623

Available from: 2015-06-16 Created: 2015-06-16 Last updated: 2016-09-05Bibliographically approved
7. A Hierarchical Reconfigurable Micro-coded Multi-core Processor for IoT Applications
Open this publication in new window or tab >>A Hierarchical Reconfigurable Micro-coded Multi-core Processor for IoT Applications
Show others...
2014 (English)In: 2014 9TH INTERNATIONAL SYMPOSIUM ON RECONFIGURABLE AND COMMUNICATION-CENTRIC SYSTEMS-ON-CHIP (RECOSOC), 2014Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a micro-coded multi-core processor featuring reconfigurability and scalability with high energy efficiency for IoT domain-specific applications. By simplifying the control logic and removing the pipelines, the gate count of one core is minimized to 14 K. Meanwhile, all the hardware units are directly controlled and can be reorganized by the long microinstructions. High utilization of the hardware is thus achieved when designing the micro programs properly. Furthermore, both the ISAs for C and Java have been implemented by the micro programs to supply the general-purpose programmability. Besides, application-specific instructions can be further developed once higher performance is demanded in specific scenarios. Depending on the performance requirement, the activity and working strategies of the cores are adjustable. Moreover, several processors can be further connected to construct a network with the integrated router for even higher performance. As a case study, the AES encryption is implemented using both C and micro programs. More than 10 times of performance improvement is achieved when using micro programs on the single core, and 20 times on two cores.

National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-158848 (URN)10.1109/ReCoSoC.2014.6861360 (DOI)000345225900030 ()2-s2.0-84905650394 (Scopus ID)978-1-4799-5810-8 (ISBN)
Conference
9th International Symposium on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), MAY 26-28, 2014, Montpellier, FRANCE
Note

QC 20150121

Available from: 2015-01-21 Created: 2015-01-12 Last updated: 2015-10-09Bibliographically approved
8. A 2-mW Multi-mode Router Design with Dual-core Processor in 65 nm LL CMOS for Inter-silicon Communication
Open this publication in new window or tab >>A 2-mW Multi-mode Router Design with Dual-core Processor in 65 nm LL CMOS for Inter-silicon Communication
Show others...
2015 (English)Manuscript (preprint) (Other academic)
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-169546 (URN)
Note

Manuscript submitted for publication

QS 201506

Available from: 2015-06-16 Created: 2015-06-16 Last updated: 2015-10-09Bibliographically approved
9. A 101.4 GOPS/W Reconfigurable and Scalable Control-centric Embedded Processor for Domain-specific Applications
Open this publication in new window or tab >>A 101.4 GOPS/W Reconfigurable and Scalable Control-centric Embedded Processor for Domain-specific Applications
Show others...
2016 (English)In: Proceedings - IEEE International Symposium on Circuits and Systems, IEEE, 2016, 1746-1749 p.Conference paper, Published paper (Refereed)
Abstract [en]

Increasing the energy efficiency and performance while providing the customizability and scalability is vital for embedded processors adapting to domain-specific applications such as Internet of Things. In this paper, we proposed a reconfigurable and scalable control-centric architecture, and implemented the design consisting of two cores and an on-chip multi-mode router in 65 nm technology. The reconfigurability is enabled by the restructurable sequence mapping table (SMT) thus the reorganizable functional units. Owing to the integration of the multi-mode router, on-chip or inter-chip network for multi-/many-core computing can be composed for performance extension on demand even in the post-fabrication stage. Control-centric design simplifies the control logic, shrinks the non-functional units and orchestrates the operations to increase the hard are utilization and reduce the excessive data movement for high energy efficiency. As a result, the processor can both conduct general-purpose processing with 29% smaller code size and application-specific processing with over 10 times performance improvement when implementing AES by SMT. The dual-core processor consumes 19.7 μW/MHz with die size of 3.5 mm2. The achieved energy efficiency is 101.4GOPS/W.

Place, publisher, year, edition, pages
IEEE, 2016
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-169547 (URN)10.1109/ISCAS.2016.7538905 (DOI)2-s2.0-84983396457 (Scopus ID)978-147995340-0 (ISBN)
Conference
IEEE International Symposium on Circuit and System (ISCAS)
Note

QC 20160613

Available from: 2015-06-16 Created: 2015-06-16 Last updated: 2016-12-15Bibliographically approved

Open Access in DiVA

Thesis(2661 kB)924 downloads
File information
File name FULLTEXT01.pdfFile size 2661 kBChecksum SHA-512
9f13daadbec829663b4a3e56a4c2e103c2a8cae5d7495f229c39aa4b9c4d83fbd45222b6719239aacbde5b56990400f0414e16505d2068220d12ea3da467f7df
Type fulltextMimetype application/pdf

Authority records BETA

Ma, Ning

Search in DiVA

By author/editor
Ma, Ning
By organisation
Industrial and Medical ElectronicsVinnExcellence Center for Intelligence in Paper and Packaging, iPACK
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 924 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1562 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf