A low-overhead soft–hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems

Khanh N. Dang, Michael Conrad Meyer, Yuichi Okuyama, Abderazek Ben Abdallah

Research output: Contribution to journalArticle

4 Citations (Scopus)

Abstract

The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.

Original languageEnglish
Pages (from-to)2705-2729
Number of pages25
JournalJournal of Supercomputing
Volume73
Issue number6
DOIs
Publication statusPublished - 2017 Jun 1
Externally publishedYes

Fingerprint

Soft Error
Many-core
Fault-tolerant
High Performance
Fault
Error Propagation
Architecture
Network on chip
Design
Network-on-chip
Deadlock
Congestion
Leverage
Buffer
Degradation
Routing
Chip
Voltage
Paradigm
Transistors

Keywords

  • 3D NoCs
  • Architecture
  • Design
  • Fault-tolerance
  • Reliability
  • Soft–hard faults

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Information Systems
  • Hardware and Architecture

Cite this

A low-overhead soft–hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems. / Dang, Khanh N.; Meyer, Michael Conrad; Okuyama, Yuichi; Abdallah, Abderazek Ben.

In: Journal of Supercomputing, Vol. 73, No. 6, 01.06.2017, p. 2705-2729.

Research output: Contribution to journalArticle

Dang, Khanh N. ; Meyer, Michael Conrad ; Okuyama, Yuichi ; Abdallah, Abderazek Ben. / A low-overhead soft–hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems. In: Journal of Supercomputing. 2017 ; Vol. 73, No. 6. pp. 2705-2729.
@article{0aee6719f9c544fbb36a4eb851731ec0,
title = "A low-overhead soft–hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems",
abstract = "The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.",
keywords = "3D NoCs, Architecture, Design, Fault-tolerance, Reliability, Soft–hard faults",
author = "Dang, {Khanh N.} and Meyer, {Michael Conrad} and Yuichi Okuyama and Abdallah, {Abderazek Ben}",
year = "2017",
month = "6",
day = "1",
doi = "10.1007/s11227-016-1951-0",
language = "English",
volume = "73",
pages = "2705--2729",
journal = "Journal of Supercomputing",
issn = "0920-8542",
publisher = "Springer Netherlands",
number = "6",

}

TY - JOUR

T1 - A low-overhead soft–hard fault-tolerant architecture, design and management scheme for reliable high-performance many-core 3D-NoC systems

AU - Dang, Khanh N.

AU - Meyer, Michael Conrad

AU - Okuyama, Yuichi

AU - Abdallah, Abderazek Ben

PY - 2017/6/1

Y1 - 2017/6/1

N2 - The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.

AB - The Network-on-Chip (NoC) paradigm has been proposed as a favorable solution to handle the strict communication requirements between the increasingly large number of cores on a single chip. However, NoC systems are exposed to the aggressive scaling down of transistors, low operating voltages, and high integration and power densities, making them vulnerable to permanent (hard) faults and transient (soft) errors. A hard fault in a NoC can lead to external blocking, causing congestion across the whole network. A soft error is more challenging because of its silent data corruption, which leads to a large area of erroneous data due to error propagation, packet re-transmission, and deadlock. In this paper, we present the architecture and design of a comprehensive soft error and hard fault-tolerant 3D-NoC system, named 3D-Hard-Fault-Soft-Error-Tolerant-OASIS-NoC (3D-FETO). With the aid of efficient mechanisms and algorithms, 3D-FETO is capable of detecting and recovering from soft errors which occur in the routing pipeline stages and leverages reconfigurable components to handle permanent faults in links, input buffers, and crossbars. In-depth evaluation results show that the 3D-FETO system is able to work around different kinds of hard faults and soft errors, ensuring graceful performance degradation, while minimizing additional hardware complexity and remaining power efficient.

KW - 3D NoCs

KW - Architecture

KW - Design

KW - Fault-tolerance

KW - Reliability

KW - Soft–hard faults

UR - http://www.scopus.com/inward/record.url?scp=85010767436&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85010767436&partnerID=8YFLogxK

U2 - 10.1007/s11227-016-1951-0

DO - 10.1007/s11227-016-1951-0

M3 - Article

AN - SCOPUS:85010767436

VL - 73

SP - 2705

EP - 2729

JO - Journal of Supercomputing

JF - Journal of Supercomputing

SN - 0920-8542

IS - 6

ER -