Reliability analysis of highly redundant distributed storage systems with Dynamic Refuging

Hiroaki Akutsu, Kazunori Ueda, Takeru Chiba, Tomohiro Kawaguchi, Norio Shimozono

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    5 Citations (Scopus)

    Abstract

    In recent data centers, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: (i) as the number of drives increases, systems are more subject to multiple drive failures and (ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of largescale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed storage areas from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamically changing amount of storage at each redundancy level due to multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, we found that the probability of data loss decreased by two orders of magnitude for systems with 384 or more drives compared to normal RAID. This technique turned out to scale well, and a system with 1536 inexpensive drives attained lower data loss probability than RAID 6 with 16 enterprise-class drives.

    Original languageEnglish
    Title of host publicationProceedings - 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages261-268
    Number of pages8
    ISBN (Print)9781479984909
    DOIs
    Publication statusPublished - 2015
    Event23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015 - Turku, Finland
    Duration: 2015 Mar 42015 Mar 6

    Other

    Other23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015
    CountryFinland
    CityTurku
    Period15/3/415/3/6

    Fingerprint

    Reliability analysis
    Redundancy
    Costs
    Industry

    Keywords

    • Erasure coding
    • Highly redundant storage systems
    • Monte Carlo simulation
    • Rebuild
    • Reliability

    ASJC Scopus subject areas

    • Hardware and Architecture
    • Computer Networks and Communications

    Cite this

    Akutsu, H., Ueda, K., Chiba, T., Kawaguchi, T., & Shimozono, N. (2015). Reliability analysis of highly redundant distributed storage systems with Dynamic Refuging. In Proceedings - 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015 (pp. 261-268). [7092730] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/PDP.2015.32

    Reliability analysis of highly redundant distributed storage systems with Dynamic Refuging. / Akutsu, Hiroaki; Ueda, Kazunori; Chiba, Takeru; Kawaguchi, Tomohiro; Shimozono, Norio.

    Proceedings - 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015. Institute of Electrical and Electronics Engineers Inc., 2015. p. 261-268 7092730.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Akutsu, H, Ueda, K, Chiba, T, Kawaguchi, T & Shimozono, N 2015, Reliability analysis of highly redundant distributed storage systems with Dynamic Refuging. in Proceedings - 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015., 7092730, Institute of Electrical and Electronics Engineers Inc., pp. 261-268, 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015, Turku, Finland, 15/3/4. https://doi.org/10.1109/PDP.2015.32
    Akutsu H, Ueda K, Chiba T, Kawaguchi T, Shimozono N. Reliability analysis of highly redundant distributed storage systems with Dynamic Refuging. In Proceedings - 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015. Institute of Electrical and Electronics Engineers Inc. 2015. p. 261-268. 7092730 https://doi.org/10.1109/PDP.2015.32
    Akutsu, Hiroaki ; Ueda, Kazunori ; Chiba, Takeru ; Kawaguchi, Tomohiro ; Shimozono, Norio. / Reliability analysis of highly redundant distributed storage systems with Dynamic Refuging. Proceedings - 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015. Institute of Electrical and Electronics Engineers Inc., 2015. pp. 261-268
    @inproceedings{15cf208f84cb4abb807542d5b9c94e20,
    title = "Reliability analysis of highly redundant distributed storage systems with Dynamic Refuging",
    abstract = "In recent data centers, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: (i) as the number of drives increases, systems are more subject to multiple drive failures and (ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of largescale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed storage areas from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamically changing amount of storage at each redundancy level due to multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, we found that the probability of data loss decreased by two orders of magnitude for systems with 384 or more drives compared to normal RAID. This technique turned out to scale well, and a system with 1536 inexpensive drives attained lower data loss probability than RAID 6 with 16 enterprise-class drives.",
    keywords = "Erasure coding, Highly redundant storage systems, Monte Carlo simulation, Rebuild, Reliability",
    author = "Hiroaki Akutsu and Kazunori Ueda and Takeru Chiba and Tomohiro Kawaguchi and Norio Shimozono",
    year = "2015",
    doi = "10.1109/PDP.2015.32",
    language = "English",
    isbn = "9781479984909",
    pages = "261--268",
    booktitle = "Proceedings - 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",

    }

    TY - GEN

    T1 - Reliability analysis of highly redundant distributed storage systems with Dynamic Refuging

    AU - Akutsu, Hiroaki

    AU - Ueda, Kazunori

    AU - Chiba, Takeru

    AU - Kawaguchi, Tomohiro

    AU - Shimozono, Norio

    PY - 2015

    Y1 - 2015

    N2 - In recent data centers, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: (i) as the number of drives increases, systems are more subject to multiple drive failures and (ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of largescale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed storage areas from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamically changing amount of storage at each redundancy level due to multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, we found that the probability of data loss decreased by two orders of magnitude for systems with 384 or more drives compared to normal RAID. This technique turned out to scale well, and a system with 1536 inexpensive drives attained lower data loss probability than RAID 6 with 16 enterprise-class drives.

    AB - In recent data centers, large-scale storage systems storing big data comprise thousands of large-capacity drives. Our goal is to establish a method for building highly reliable storage systems using more than a thousand low-cost large-capacity drives. Some large-scale storage systems protect data by erasure coding to prevent data loss. As the redundancy level of erasure coding is increased, the probability of data loss will decrease, but the increase in normal data write operation and additional storage for coding will be incurred. We therefore need to achieve high reliability at the lowest possible redundancy level. There are two concerns regarding reliability in large-scale storage systems: (i) as the number of drives increases, systems are more subject to multiple drive failures and (ii) distributing stripes among many drives can speed up the rebuild time but increase the risk of data loss due to multiple drive failures. These concerns were not addressed in prior quantitative reliability studies based on realistic settings. In this work, we analyze the reliability of largescale storage systems with distributed stripes, focusing on an effective rebuild method which we call Dynamic Refuging. Dynamic Refuging rebuilds failed storage areas from those with the lowest redundancy and strategically selects blocks to read for repairing lost data. We modeled the dynamically changing amount of storage at each redundancy level due to multiple drive failures, and performed reliability analysis with Monte Carlo simulation using realistic drive failure characteristics. When stripes with redundancy level 3 were sufficiently distributed and rebuilt by Dynamic Refuging, we found that the probability of data loss decreased by two orders of magnitude for systems with 384 or more drives compared to normal RAID. This technique turned out to scale well, and a system with 1536 inexpensive drives attained lower data loss probability than RAID 6 with 16 enterprise-class drives.

    KW - Erasure coding

    KW - Highly redundant storage systems

    KW - Monte Carlo simulation

    KW - Rebuild

    KW - Reliability

    UR - http://www.scopus.com/inward/record.url?scp=84957672812&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84957672812&partnerID=8YFLogxK

    U2 - 10.1109/PDP.2015.32

    DO - 10.1109/PDP.2015.32

    M3 - Conference contribution

    SN - 9781479984909

    SP - 261

    EP - 268

    BT - Proceedings - 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015

    PB - Institute of Electrical and Electronics Engineers Inc.

    ER -