kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Huan, Yuxiang
Publications (10 of 16) Show all publications
Xu, J., Huan, Y., Huang, B., Chu, H., Jin, Y., Zheng, L. & Zou, Z. (2021). A Memory-Efficient CNN Accelerator Using Segmented Logarithmic Quantization and Multi-Cluster Architecture. IEEE Transactions on Circuits and Systems - II - Express Briefs, 68(6), 2142-2146
Open this publication in new window or tab >>A Memory-Efficient CNN Accelerator Using Segmented Logarithmic Quantization and Multi-Cluster Architecture
Show others...
2021 (English)In: IEEE Transactions on Circuits and Systems - II - Express Briefs, ISSN 1549-7747, E-ISSN 1558-3791, Vol. 68, no 6, p. 2142-2146Article in journal (Refereed) Published
Abstract [en]

This paper presents a memory-efficient CNN accelerator design for resource-constrained devices in Internet of Things (IoT) and autonomous systems. A segmented logarithmic (SegLog) quantization method is exploited to mitigate the on-chip memory and bandwidth requirements, thus accommodating more processing elements (PEs) in a given chip area to organize a reconfigurable multi-cluster architecture. Such algorithm-architecture joint optimization improves the utilization and efficiency of memory resources. SegLog quantization adopting mixed bases optimizes fixed-points placement in different segmentations and improves network accuracy at low-precision representation, while the multi-cluster architecture can reorganize PEs to adapt to various CNN models for efficient dataflow and multi-level data reuse. The evaluation results show that SegLog quantization can achieve 6.4× model compression with 1.73%, 0.74%, 2.11%, and 1.76% accuracy penalty on AlexNet, VGG16, ResNet34, and DenseNet161, respectively. An ASIC implementation with 168 PEs configuration is validated in a 40-nm CMOS process, with 2.54 TOPs/W energy efficiency and 0.8 mm chip area reported. The accelerator has also been implemented on FPGA with 1512 PEs configured and 468 kB on-chip memory thanks to the extensibility of the architecture. It delivers up to 604.8 GOPs performance at 200 MHz, corresponding to a 1.29 GOPs/kB memory efficiency. Compared with the state-of-the-art accelerators, our ASIC implementation enhances area efficiency and arithmetic intensity by 1.94× and 5.62×, while the FPGA implementation achieves the memory efficiency improvement by a factor of 2.34×. IEEE

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
Keywords
Convolutional neural network (CNN), Data models, dataflow, Field programmable gate arrays, Internet of Things, Memory management, memory-efficient accelerator., Optimization, quantization, Quantization (signal), System-on-chip
National Category
Embedded Systems
Identifiers
urn:nbn:se:kth:diva-290396 (URN)10.1109/TCSII.2020.3038897 (DOI)000655844400081 ()2-s2.0-85097142993 (Scopus ID)
Note

QC 20210219

Available from: 2021-02-19 Created: 2021-02-19 Last updated: 2023-09-26Bibliographically approved
Ding, C., Huan, Y., Jia, H., Yan, Y., Yang, F., Zou, Z. & Zheng, L.-R. (2021). An Ultra-Low Latency Multicast Router for Large-Scale Multi-Chip Neuromorphic Processing. In: 2021 IEEE 3rd international conference on artificial intelligence circuits and systems (AICASs): . Paper presented at IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), JUN 06-09, 2021, ELECTR NETWORK. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>An Ultra-Low Latency Multicast Router for Large-Scale Multi-Chip Neuromorphic Processing
Show others...
2021 (English)In: 2021 IEEE 3rd international conference on artificial intelligence circuits and systems (AICASs), Institute of Electrical and Electronics Engineers (IEEE) , 2021Conference paper, Published paper (Refereed)
Abstract [en]

Neuromorphic simulation is fundamental to the study of information processing mechanism of the human brain and can further inspire application development of event-driven spiking neural networks. However large-scale neuromorphic simulation requires massive parallelism on multi-chip processing and imposes great challenges on dealing with data transmission latency and congestion problems between chips, especially when the number of simulated neurons reaches to billions or even trillions level. In this paper, we propose an ultra-low-latency on-chip router together with a multicast routing algorithm that focuses on reducing global loads and balancing loads between links. Additionally, we build a large-scale neuromorphic simulation platform consisting of 64 FPGA chips and evaluate the proposed design on it. The experiment results suggest that this design benefits from the proposed multicast routing algorithm in global communication loads and simulation capacity. This work has 4.1% similar to 5.2% reduction of global loads comparing to previous works and can achieve a latency as low as 25ns and a maximum data throughput of 6.25Gbps/chip.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-306449 (URN)10.1109/AICAS51828.2021.9458445 (DOI)000722241000021 ()2-s2.0-85113322064 (Scopus ID)
Conference
IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), JUN 06-09, 2021, ELECTR NETWORK
Note

QC 20211217

conference ISBN 978-1-6654-1913-0

Available from: 2021-12-17 Created: 2021-12-17 Last updated: 2022-10-24Bibliographically approved
Huang, B., Huan, Y., Chu, H., Xu, J., Liu, L., Zheng, L. & Zou, Z. (2021). IECA: An In-Execution Configuration CNN Accelerator With 30.55 GOPS/mm(2) Area Efficiency. IEEE Transactions on Circuits and Systems Part 1: Regular Papers, 68(11), 4672-4685
Open this publication in new window or tab >>IECA: An In-Execution Configuration CNN Accelerator With 30.55 GOPS/mm(2) Area Efficiency
Show others...
2021 (English)In: IEEE Transactions on Circuits and Systems Part 1: Regular Papers, ISSN 1549-8328, E-ISSN 1558-0806, Vol. 68, no 11, p. 4672-4685Article in journal (Refereed) Published
Abstract [en]

It remains challenging for a Convolutional Neural Network (CNN) accelerator to maintain high hardware utilization and low processing latency with restricted on-chip memory. This paper presents an In-Execution Configuration Accelerator (IECA) that realizes an efficient control scheme, exploring architectural data reuse, unified in-execution controlling, and pipelined latency hiding to minimize configuration overhead out of the computation scope. The proposed IECA achieves row-wise convolution with tiny distributed buffers and reduces the size of total on-chip memory by removing 40% of redundant memory storage with shared delay chains. By exploiting a reconfigurable Sequence Mapping Table (SMT) and Finite State Machine (FSM) control, the chip realizes cycle-accurate Processing Element (PE) control, automatic loop tiling and latency hiding without extra time slots for pre-configuration. Evaluated on AlexNet and VGG-16, the IECA retains over 97.3% PE utilization and over 95.6% memory access time hiding on average. The chip is designed and fabricated in a UMC 55-nm process running at a frequency of 250 MHz and achieves an area efficiency of 30.55 GOPS/mm(2) and 0.244 GOPS/KGE (kilo-gate-equivalent), which makes an over 2.0x and 2.1x improvement, respectively, compared with that of previous related works. Implementation of the IEC control scheme uses only a 0.55% area of the 2.75 mm(2) core.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
Keywords
Convolutional neural network (CNN), area-efficient, accelerator, in-execution configuration
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-305370 (URN)10.1109/TCSI.2021.3108762 (DOI)000716698600026 ()2-s2.0-85119498618 (Scopus ID)
Note

QC 20211203

Available from: 2021-12-03 Created: 2021-12-03 Last updated: 2022-10-24Bibliographically approved
Jin, Y., Cai, J., Xu, J., Huan, Y., Yan, Y., Huang, B., . . . Zou, Z. (2021). Self-aware distributed deep learning framework for heterogeneous IoT edge devices. Future Generation Computer Systems, 125, 908-920
Open this publication in new window or tab >>Self-aware distributed deep learning framework for heterogeneous IoT edge devices
Show others...
2021 (English)In: Future Generation Computer Systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 125, p. 908-920Article in journal (Refereed) Published
Abstract [en]

Implementing artificial intelligence (AI) in the Internet of Things (IoT) involves a move from the cloud to the heterogeneous and low-power edge, following an urgent demand for deploying complex training tasks in a distributed and reliable manner. This work proposes a self-aware distributed deep learning (DDL) framework for IoT applications, which is applicable to heterogeneous edge devices aiming to improve adaptivity and amortize the training cost. The self-aware design including the dynamic self-organizing approach and the self-healing method enhances the system reliability and resilience. Three typical edge devices are adopted with cross-platform Docker deployment: Personal Computers (PC) for general computing devices, Raspberry Pi 4Bs (Rpi) for resource-constrained edge devices, and Jetson Nanos (Jts) for AI-enabled edge devices. Benchmarked with ResNet-32 on CIFAR-10, the training efficiency of tested distributed clusters is increased by 8.44x compared to the standalone Rpi. The cluster with 11 heterogeneous edge devices achieves a training efficiency of 200.4 images/s and an accuracy of 92.45%. Results prove that the self-organizing approach functions well with dynamic changes like devices being removed or added. The self-healing method is evaluated with various stabilities, cluster scales, and breakdown cases, testifying that the reliability can be largely enhanced for extensively distributed deployments. The proposed DDL framework shows excellent performance for training implementation with heterogeneous edge devices in IoT applications with high-degree scalability and reliability.

Place, publisher, year, edition, pages
Elsevier BV, 2021
Keywords
Internet of Things (IoT), Edge computing, Distributed deep learning, Deep neural networks, Self-awareness
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-300942 (URN)10.1016/j.future.2021.07.010 (DOI)000687315100015 ()2-s2.0-85111584461 (Scopus ID)
Note

QC 20210903

Available from: 2021-09-03 Created: 2021-09-03 Last updated: 2024-09-04Bibliographically approved
Xu, J., Huan, Y., Zheng, L.-R. -. & Zou, Z. (2019). A Low-Power Arithmetic Element for Multi-Base Logarithmic Computation on Deep Neural Networks. In: International System on Chip Conference: . Paper presented at 31st IEEE International System on Chip Conference, SOCC 2018, 4 September 2018 through 7 September 2018 (pp. 260-265). IEEE Computer Society
Open this publication in new window or tab >>A Low-Power Arithmetic Element for Multi-Base Logarithmic Computation on Deep Neural Networks
2019 (English)In: International System on Chip Conference, IEEE Computer Society , 2019, p. 260-265Conference paper, Published paper (Refereed)
Abstract [en]

Computational complexity and memory intensity are crucial in deep convolutional neural network algorithms for deployment to embedded systems. Recent advances in logarithmic quantization has manifested great potential in reducing the inference cost of neural network models. However, current base-2 logarithmic quantization suffers from performance upper limit and there is few work that studies hardware implementation of other bases. This paper presents a multi-base logarithmic scheme for Deep Neural Networks (DNNs). The performance of Alexnet is studied with respects to different quantization resolutions. Base -\sqrt2 logarithmic quantization is able to raise the ceiling of top-5 classifying accuracy from 69.3% to 75.5% at 5-bit resolution. A segmented logarithmic quantization method that combines both base-2 and base \sqrt2 is then proposed to improve the network top-5 accuracy to 72.3% in 4-bit resolution. The corresponding arithmetic element hardware has been designed, which supports base sqrt2 logarithmic quantization and segmented logarithmic quantization respectively. Evaluated in UMC 65nm process, the proposed arithmetic element operating at 500MHz and 1.2V consumes as low as 120 μW. Compared with 16-bit fixed point multiplier, our design achieves 58.03% smaller in area, with 73.74% energy reduction.

Place, publisher, year, edition, pages
IEEE Computer Society, 2019
Keywords
Embedded systems, Low power electronics, Neural networks, Programmable logic controllers, Convolutional neural network, Energy reduction, Fixed points, Hardware implementations, Logarithmic computation, Low power arithmetic, Neural network model, Quantization resolution, Deep neural networks
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-248272 (URN)10.1109/SOCC.2018.8618560 (DOI)000462047000009 ()2-s2.0-85062221602 (Scopus ID)
Conference
31st IEEE International System on Chip Conference, SOCC 2018, 4 September 2018 through 7 September 2018
Note

QC 20190411

Part of ISBN 9781538614907

Available from: 2019-04-11 Created: 2019-04-11 Last updated: 2024-10-15Bibliographically approved
Chu, H., Huan, Y., Bao, D., Kallback, B., Qin, Y., Zou, Z. & Zheng, L. (2019). An ASIC Design of Multi-Electrode Digital Basket Catheter Systems with Reconfigurable Compressed Sampling. In: International System on Chip Conference: . Paper presented at 31st IEEE International System on Chip Conference, SOCC 2018, 4 September 2018 through 7 September 2018 (pp. 308-313). IEEE Computer Society
Open this publication in new window or tab >>An ASIC Design of Multi-Electrode Digital Basket Catheter Systems with Reconfigurable Compressed Sampling
Show others...
2019 (English)In: International System on Chip Conference, IEEE Computer Society , 2019, p. 308-313Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents an Application Specific Integrated Circuit (ASIC) design with reconfigurable compressed sampling (CS) for multi-electrode basket catheter systems that acquire intracardiac electrograms (IEGMs). This work adopts a reconfigurable CS (ReCS) encoder for near-electrode processing to enable sub-Nyquist sampling rate thus improve the system capacity. The ReCS encoder is designed to work with a reconfigurable compression cycle as well as a reconfigurable compression ratio, which makes it suitable for a wide range of different signals. This digital ASIC chip is placed at the distal end of the catheter close to electrodes, so that all signals have been digitalized and encoded before transmitting to an external receiver. Such architecture ensures serial data transmission, reducing number of traces and size of the catheter, as well as fabrication complexity. Evaluated area cost of total digital circuits is 0.046 mm 2 and the power consumption is 49.1 μW with 4 MHz clock frequency in 65 nm process.

Place, publisher, year, edition, pages
IEEE Computer Society, 2019
Keywords
basket catheter, compressed sampling, digital catheter, Catheters, Electrodes, Integrated circuit design, Programmable logic controllers, Rhenium compounds, Signal encoding, Clock frequency, Compressed samplings, Electrode processing, Intracardiac electrograms, Multi-electrode, Serial data transmissions, Sub-Nyquist sampling, System Capacity, Application specific integrated circuits
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-248273 (URN)10.1109/SOCC.2018.8618535 (DOI)000462047000023 ()2-s2.0-85062213797 (Scopus ID)
Conference
31st IEEE International System on Chip Conference, SOCC 2018, 4 September 2018 through 7 September 2018
Note

QC 20190408

Part of ISBN 9781538614907

Available from: 2019-04-08 Created: 2019-04-08 Last updated: 2024-10-15Bibliographically approved
Huang, B., Huan, Y., Xu, L. D., Zheng, L. & Zou, Z. (2019). Automated trading systems statistical and machine learning methods and hardware implementation: a survey. Enterprise Information Systems, 13(1), 132-144
Open this publication in new window or tab >>Automated trading systems statistical and machine learning methods and hardware implementation: a survey
Show others...
2019 (English)In: Enterprise Information Systems, ISSN 1751-7575, E-ISSN 1751-7583, Vol. 13, no 1, p. 132-144Article in journal (Refereed) Published
Abstract [en]

Automated trading, which is also known as algorithmic trading, is a method of using a predesigned computer program to submit a large number of trading orders to an exchange. It is substantially a real-time decision-making system which is under the scope of Enterprise Information System (EIS). With the rapid development of telecommunication and computer technology, the mechanisms underlying automated trading systems have become increasingly diversified. Considerable effort has been exerted by both academia and trading firms towards mining potential factors that may generate significantly higher profits. In this paper, we review studies on trading systems built using various methods and empirically evaluate the methods by grouping them into three types: technical analyses, textual analyses and high-frequency trading. Then, we evaluate the advantages and disadvantages of each method and assess their future prospects.

Place, publisher, year, edition, pages
Taylor & Francis, 2019
Keywords
Survey, algorithmic trading, statistics, machine learning, high frequency trading, hardware implementation
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-240695 (URN)10.1080/17517575.2018.1493145 (DOI)000452787000006 ()2-s2.0-85058227759 (Scopus ID)
Note

QC20190109

Available from: 2019-01-09 Created: 2019-01-09 Last updated: 2022-10-24Bibliographically approved
Huan, Y., Xu, J., Zheng, L.-r., Tenhunen, H. & Zou, Z. (2018). A 3D Tiled Low Power Accelerator for Convolutional Neural Network. In: 2018 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS): . Paper presented at IEEE International Symposium on Circuits and Systems (ISCAS), MAY 27-30, 2018, Florence, ITALY. IEEE
Open this publication in new window or tab >>A 3D Tiled Low Power Accelerator for Convolutional Neural Network
Show others...
2018 (English)In: 2018 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), IEEE , 2018Conference paper, Published paper (Refereed)
Abstract [en]

It remains a challenge to run Deep Learning in devices with stringent power budget in the Internet-of-Things. This paper presents a low-power accelerator for processing Convolutional Neural Networks on the embedded devices. The power reduction is realized by exploring data reuse in three different aspects, with regards to convolution, filter and input features. A systolic-like data flow is proposed and applied to rows of Processing Elements (PEs), which facilitate reusing the data during convolution. Reuse of input features and filters is achieved by arranging the PE array in a 3D tiled architecture, whose dimension is 3 x 14 x 4. Local storage within PEs is therefore reduced and only cost 17.75 kB, which is 20% of the state-of-the-art. With dedicated delay chains in each PE, this accelerator is reconfigurable to suit various parameter settings of convolutional layers. Evaluated in UMC 65 nm low leakage process, the accelerator can reach a peak performance of 84 GOPS and consume only 136 mW at 250 Mhz.

Place, publisher, year, edition, pages
IEEE, 2018
Series
IEEE International Symposium on Circuits and Systems, ISSN 0271-4302
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-240033 (URN)10.1109/ISCAS.2018.8351301 (DOI)000451218701203 ()2-s2.0-85057087284 (Scopus ID)978-1-5386-4881-0 (ISBN)
Conference
IEEE International Symposium on Circuits and Systems (ISCAS), MAY 27-30, 2018, Florence, ITALY
Note

QC 20181210

Available from: 2018-12-10 Created: 2018-12-10 Last updated: 2024-01-08Bibliographically approved
Liu, L., Jin, Y., Liu, Y., Ma, N., Huan, Y., Zou, Z. & Zheng, L. (2018). A Design of Autonomous Error-Tolerant Architectures for Massively Parallel Computing. IEEE Transactions on Very Large Scale Integration (vlsi) Systems, 26(10), 2143-2154
Open this publication in new window or tab >>A Design of Autonomous Error-Tolerant Architectures for Massively Parallel Computing
Show others...
2018 (English)In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 26, no 10, p. 2143-2154Article in journal (Refereed) Published
Abstract [en]

The massively parallel computing systems composed of many processors are connected on chips, which will become more and more complex and unreliable. This paper presents an error-tolerant design based on the autonomous error-tolerant (AET) architecture that aims to have a self-repairing capability. A nearby error sensing mechanism is designed to discover faults, and an active evolution scheme is studied to handle unrecoverable errors. A circuit backup switching mechanism is proposed to bypass the failed nodes. The board-level prototype is implemented based on dual-core embedded processors. The analysis shows that the error-tolerant capability of the proposed architecture is better than the conventional multimodular redundant system when the failure rate of a single core is less than 0.7. In the AET test system consisting of 16 processors, the error-tolerant capability is verified. The results show that the relative variation of the overall performance of the AET system will not be changed due to the high reliability requirements of the system. Through experimental comparison, under the premise that the architecture of AET and the triple modular redundancy method are basically consistent in reliability, whether on the logical-level error tolerant or on the physical-level error tolerant, the former has lower power consumption.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Keywords
Error tolerant, nanosystem, self-reparation, sensing
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-237109 (URN)10.1109/TVLSI.2018.2846298 (DOI)000446332500029 ()2-s2.0-85049490607 (Scopus ID)
Note

QC 20181030

Available from: 2018-10-30 Created: 2018-10-30 Last updated: 2024-03-18Bibliographically approved
Xu, J., Huan, Y., Yang, K., Zhan, Y., Zou, Z. & Zheng, L.-R. (2018). Optimized Near-Zero Quantization Method for Flexible Memristor Based Neural Network. IEEE Access, 6, 29320-29331
Open this publication in new window or tab >>Optimized Near-Zero Quantization Method for Flexible Memristor Based Neural Network
Show others...
2018 (English)In: IEEE Access, E-ISSN 2169-3536, Vol. 6, p. 29320-29331Article in journal (Refereed) Published
Abstract [en]

Due to controllable conductance and non-volatility, flexible memristors are regarded as a key enabler for building artificial neural network (ANN)-based learning algorithms in flexible and wearable systems. However, the existing flexible memristors are suffering from limited number of conductance values, issues limiting large-scale integration, and insufficient accuracy that cannot support accurate computation of ANN. In this paper, solutions are proposed for the three major challenges of the flexible memristor; the feasibility of a three-layer fully connected neural network on MNIST and a 13-layer convolutional neural network (CNN) on CIFAR-10 using the flexible memristor based on single-walled carbon nanotubes network/polymer composite and hydrophilic Al2O3 dielectric are studied. The evaluation result shows that in the fully connected neural network system, it is able to recognize MNIST with an accuracy above 90% after 4-bit quantization, 52.05% decrease in interconnection numbers in the circuit and up to 40% random error introduced, and in the CNN on CIFAR-10, the system can retain an accuracy above 86% with less than 4% accuracy loss after 5-bit quantization, 59.34% decrease in interconnection numbers in the circuit and up to 40% random error injected.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Keywords
Artificial neural network, flexible memristor, near-zero optimizing, system resilience, weight quantization
National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-231644 (URN)10.1109/ACCESS.2018.2839106 (DOI)000435521500013 ()2-s2.0-85047177189 (Scopus ID)
Note

QC 20180904

Available from: 2018-09-04 Created: 2018-09-04 Last updated: 2022-10-24Bibliographically approved
Organisations

Search in DiVA

Show all publications