Failure detection in P2P-grid system

研究成果: Article査読

抄録

Peer-to-peer (P2P)-Grid systems are being investigated as a platform for converging the Grid and P2P network in the construction of large-scale distributed applications. The highly dynamic nature of P2P-Grid systems greatly affects the execution of the distributed program. Uncertainty caused by arbitrary node failure and departure significantly affects the availability of computing resources and system performance. Checkpoint-and-restart is the most common scheme for fault tolerance because it periodically saves the execution progress onto stable storage. In this paper, we suggest a checkpoint-and-restart mechanism as a fault-tolerant method for applications on P2P-Grid systems. Failure detection mechanism is a necessary prerequisite to fault tolerance and fault recovery in general. Given the highly dynamic nature of nodes within P2P-Grid systems, any failure should be detected to ensure effective task execution. Therefore, failure detection mechanism as an integral part of P2P-Grid systems was studied. We discussed how the design of various failure detection algorithms affects their performance in average failure detection time of nodes. Numerical analysis results and implementation evaluation are also provided to show different average failure detection times in real systems for various failure detection algorithms. The comparison shows the shortest average failure detection time by 8.8s on basis of the WP failure detector. Our lowest mean time to recovery (MTTR) is also proven to have a distinct advantage with a time consumption reduction of about 5.5s over its counterparts.

本文言語English
ページ(範囲)2123-2131
ページ数9
ジャーナルIEICE Transactions on Information and Systems
E98D
12
DOI
出版ステータスPublished - 2015 12月

ASJC Scopus subject areas

  • ソフトウェア
  • ハードウェアとアーキテクチャ
  • コンピュータ ビジョンおよびパターン認識
  • 電子工学および電気工学
  • 人工知能

フィンガープリント

「Failure detection in P2P-grid system」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル