Failure detection in P2P-grid system

    Research output: Contribution to journalArticle

    Abstract

    Peer-to-peer (P2P)-Grid systems are being investigated as a platform for converging the Grid and P2P network in the construction of large-scale distributed applications. The highly dynamic nature of P2P-Grid systems greatly affects the execution of the distributed program. Uncertainty caused by arbitrary node failure and departure significantly affects the availability of computing resources and system performance. Checkpoint-and-restart is the most common scheme for fault tolerance because it periodically saves the execution progress onto stable storage. In this paper, we suggest a checkpoint-and-restart mechanism as a fault-tolerant method for applications on P2P-Grid systems. Failure detection mechanism is a necessary prerequisite to fault tolerance and fault recovery in general. Given the highly dynamic nature of nodes within P2P-Grid systems, any failure should be detected to ensure effective task execution. Therefore, failure detection mechanism as an integral part of P2P-Grid systems was studied. We discussed how the design of various failure detection algorithms affects their performance in average failure detection time of nodes. Numerical analysis results and implementation evaluation are also provided to show different average failure detection times in real systems for various failure detection algorithms. The comparison shows the shortest average failure detection time by 8.8s on basis of the WP failure detector. Our lowest mean time to recovery (MTTR) is also proven to have a distinct advantage with a time consumption reduction of about 5.5s over its counterparts.

    Original languageEnglish
    Pages (from-to)2123-2131
    Number of pages9
    JournalIEICE Transactions on Information and Systems
    VolumeE98D
    Issue number12
    DOIs
    Publication statusPublished - 2015 Dec 1

    Fingerprint

    Fault tolerance
    Recovery
    Numerical analysis
    Availability
    Detectors
    Uncertainty

    Keywords

    • Failure detection
    • Failure recovery
    • Fault tolerance
    • P2P-grid systems

    ASJC Scopus subject areas

    • Electrical and Electronic Engineering
    • Software
    • Artificial Intelligence
    • Hardware and Architecture
    • Computer Vision and Pattern Recognition

    Cite this

    Failure detection in P2P-grid system. / Wang, Huan; Nakazato, Hidenori.

    In: IEICE Transactions on Information and Systems, Vol. E98D, No. 12, 01.12.2015, p. 2123-2131.

    Research output: Contribution to journalArticle

    @article{e8dd6cd3034a4ec69e2be601f9845e6e,
    title = "Failure detection in P2P-grid system",
    abstract = "Peer-to-peer (P2P)-Grid systems are being investigated as a platform for converging the Grid and P2P network in the construction of large-scale distributed applications. The highly dynamic nature of P2P-Grid systems greatly affects the execution of the distributed program. Uncertainty caused by arbitrary node failure and departure significantly affects the availability of computing resources and system performance. Checkpoint-and-restart is the most common scheme for fault tolerance because it periodically saves the execution progress onto stable storage. In this paper, we suggest a checkpoint-and-restart mechanism as a fault-tolerant method for applications on P2P-Grid systems. Failure detection mechanism is a necessary prerequisite to fault tolerance and fault recovery in general. Given the highly dynamic nature of nodes within P2P-Grid systems, any failure should be detected to ensure effective task execution. Therefore, failure detection mechanism as an integral part of P2P-Grid systems was studied. We discussed how the design of various failure detection algorithms affects their performance in average failure detection time of nodes. Numerical analysis results and implementation evaluation are also provided to show different average failure detection times in real systems for various failure detection algorithms. The comparison shows the shortest average failure detection time by 8.8s on basis of the WP failure detector. Our lowest mean time to recovery (MTTR) is also proven to have a distinct advantage with a time consumption reduction of about 5.5s over its counterparts.",
    keywords = "Failure detection, Failure recovery, Fault tolerance, P2P-grid systems",
    author = "Huan Wang and Hidenori Nakazato",
    year = "2015",
    month = "12",
    day = "1",
    doi = "10.1587/transinf.2015PAP0014",
    language = "English",
    volume = "E98D",
    pages = "2123--2131",
    journal = "IEICE Transactions on Information and Systems",
    issn = "0916-8532",
    publisher = "Maruzen Co., Ltd/Maruzen Kabushikikaisha",
    number = "12",

    }

    TY - JOUR

    T1 - Failure detection in P2P-grid system

    AU - Wang, Huan

    AU - Nakazato, Hidenori

    PY - 2015/12/1

    Y1 - 2015/12/1

    N2 - Peer-to-peer (P2P)-Grid systems are being investigated as a platform for converging the Grid and P2P network in the construction of large-scale distributed applications. The highly dynamic nature of P2P-Grid systems greatly affects the execution of the distributed program. Uncertainty caused by arbitrary node failure and departure significantly affects the availability of computing resources and system performance. Checkpoint-and-restart is the most common scheme for fault tolerance because it periodically saves the execution progress onto stable storage. In this paper, we suggest a checkpoint-and-restart mechanism as a fault-tolerant method for applications on P2P-Grid systems. Failure detection mechanism is a necessary prerequisite to fault tolerance and fault recovery in general. Given the highly dynamic nature of nodes within P2P-Grid systems, any failure should be detected to ensure effective task execution. Therefore, failure detection mechanism as an integral part of P2P-Grid systems was studied. We discussed how the design of various failure detection algorithms affects their performance in average failure detection time of nodes. Numerical analysis results and implementation evaluation are also provided to show different average failure detection times in real systems for various failure detection algorithms. The comparison shows the shortest average failure detection time by 8.8s on basis of the WP failure detector. Our lowest mean time to recovery (MTTR) is also proven to have a distinct advantage with a time consumption reduction of about 5.5s over its counterparts.

    AB - Peer-to-peer (P2P)-Grid systems are being investigated as a platform for converging the Grid and P2P network in the construction of large-scale distributed applications. The highly dynamic nature of P2P-Grid systems greatly affects the execution of the distributed program. Uncertainty caused by arbitrary node failure and departure significantly affects the availability of computing resources and system performance. Checkpoint-and-restart is the most common scheme for fault tolerance because it periodically saves the execution progress onto stable storage. In this paper, we suggest a checkpoint-and-restart mechanism as a fault-tolerant method for applications on P2P-Grid systems. Failure detection mechanism is a necessary prerequisite to fault tolerance and fault recovery in general. Given the highly dynamic nature of nodes within P2P-Grid systems, any failure should be detected to ensure effective task execution. Therefore, failure detection mechanism as an integral part of P2P-Grid systems was studied. We discussed how the design of various failure detection algorithms affects their performance in average failure detection time of nodes. Numerical analysis results and implementation evaluation are also provided to show different average failure detection times in real systems for various failure detection algorithms. The comparison shows the shortest average failure detection time by 8.8s on basis of the WP failure detector. Our lowest mean time to recovery (MTTR) is also proven to have a distinct advantage with a time consumption reduction of about 5.5s over its counterparts.

    KW - Failure detection

    KW - Failure recovery

    KW - Fault tolerance

    KW - P2P-grid systems

    UR - http://www.scopus.com/inward/record.url?scp=84948742287&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84948742287&partnerID=8YFLogxK

    U2 - 10.1587/transinf.2015PAP0014

    DO - 10.1587/transinf.2015PAP0014

    M3 - Article

    VL - E98D

    SP - 2123

    EP - 2131

    JO - IEICE Transactions on Information and Systems

    JF - IEICE Transactions on Information and Systems

    SN - 0916-8532

    IS - 12

    ER -