Reliability and failure impact analysis of distributed storage systems with dynamic refuging

Hiroaki Akutsu, Kazunori Ueda, Takeru Chiba, Tomohiro Kawaguchi, Norio Shimozono

    Research output: Contribution to journalArticle

    Abstract

    In recent data centers, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: (i) as the number of drives increases, systems are more subject to multiple drive failures and (ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. If data loss occurs by multiple drive failure, it affects many users using a storage system. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of large-scale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed blocks from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamic change of amount of storage at each redundancy level caused by multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. We showed a failure impact model and a method for localizing the failure. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, the proposed technique turned out to scale well, and the probability of data loss decreased by two orders of magnitude for systems with a thousand drives compared to normal RAID. The appropriate setting of a stripe distribution level could localize the failure.

    Original languageEnglish
    Pages (from-to)2259-2268
    Number of pages10
    JournalIEICE Transactions on Information and Systems
    VolumeE99D
    Issue number9
    DOIs
    Publication statusPublished - 2016 Sep 1

    Fingerprint

    Redundancy
    Reliability analysis
    Costs

    Keywords

    • Erasure coding
    • Highly redundant storage systems
    • Monte Carlo simulation
    • Rebuild
    • Reliability

    ASJC Scopus subject areas

    • Software
    • Hardware and Architecture
    • Computer Vision and Pattern Recognition
    • Artificial Intelligence
    • Electrical and Electronic Engineering

    Cite this

    Reliability and failure impact analysis of distributed storage systems with dynamic refuging. / Akutsu, Hiroaki; Ueda, Kazunori; Chiba, Takeru; Kawaguchi, Tomohiro; Shimozono, Norio.

    In: IEICE Transactions on Information and Systems, Vol. E99D, No. 9, 01.09.2016, p. 2259-2268.

    Research output: Contribution to journalArticle

    Akutsu, Hiroaki ; Ueda, Kazunori ; Chiba, Takeru ; Kawaguchi, Tomohiro ; Shimozono, Norio. / Reliability and failure impact analysis of distributed storage systems with dynamic refuging. In: IEICE Transactions on Information and Systems. 2016 ; Vol. E99D, No. 9. pp. 2259-2268.
    @article{4a9eccc625a84d73a1af0aae4593b01a,
    title = "Reliability and failure impact analysis of distributed storage systems with dynamic refuging",
    abstract = "In recent data centers, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: (i) as the number of drives increases, systems are more subject to multiple drive failures and (ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. If data loss occurs by multiple drive failure, it affects many users using a storage system. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of large-scale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed blocks from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamic change of amount of storage at each redundancy level caused by multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. We showed a failure impact model and a method for localizing the failure. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, the proposed technique turned out to scale well, and the probability of data loss decreased by two orders of magnitude for systems with a thousand drives compared to normal RAID. The appropriate setting of a stripe distribution level could localize the failure.",
    keywords = "Erasure coding, Highly redundant storage systems, Monte Carlo simulation, Rebuild, Reliability",
    author = "Hiroaki Akutsu and Kazunori Ueda and Takeru Chiba and Tomohiro Kawaguchi and Norio Shimozono",
    year = "2016",
    month = "9",
    day = "1",
    doi = "10.1587/transinf.2016EDP7139",
    language = "English",
    volume = "E99D",
    pages = "2259--2268",
    journal = "IEICE Transactions on Information and Systems",
    issn = "0916-8532",
    publisher = "Maruzen Co., Ltd/Maruzen Kabushikikaisha",
    number = "9",

    }

    TY - JOUR

    T1 - Reliability and failure impact analysis of distributed storage systems with dynamic refuging

    AU - Akutsu, Hiroaki

    AU - Ueda, Kazunori

    AU - Chiba, Takeru

    AU - Kawaguchi, Tomohiro

    AU - Shimozono, Norio

    PY - 2016/9/1

    Y1 - 2016/9/1

    N2 - In recent data centers, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: (i) as the number of drives increases, systems are more subject to multiple drive failures and (ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. If data loss occurs by multiple drive failure, it affects many users using a storage system. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of large-scale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed blocks from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamic change of amount of storage at each redundancy level caused by multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. We showed a failure impact model and a method for localizing the failure. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, the proposed technique turned out to scale well, and the probability of data loss decreased by two orders of magnitude for systems with a thousand drives compared to normal RAID. The appropriate setting of a stripe distribution level could localize the failure.

    AB - In recent data centers, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: (i) as the number of drives increases, systems are more subject to multiple drive failures and (ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. If data loss occurs by multiple drive failure, it affects many users using a storage system. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of large-scale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed blocks from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamic change of amount of storage at each redundancy level caused by multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. We showed a failure impact model and a method for localizing the failure. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, the proposed technique turned out to scale well, and the probability of data loss decreased by two orders of magnitude for systems with a thousand drives compared to normal RAID. The appropriate setting of a stripe distribution level could localize the failure.

    KW - Erasure coding

    KW - Highly redundant storage systems

    KW - Monte Carlo simulation

    KW - Rebuild

    KW - Reliability

    UR - http://www.scopus.com/inward/record.url?scp=84984908596&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84984908596&partnerID=8YFLogxK

    U2 - 10.1587/transinf.2016EDP7139

    DO - 10.1587/transinf.2016EDP7139

    M3 - Article

    VL - E99D

    SP - 2259

    EP - 2268

    JO - IEICE Transactions on Information and Systems

    JF - IEICE Transactions on Information and Systems

    SN - 0916-8532

    IS - 9

    ER -