TY - JOUR
T1 - Efficient checkpointing with recompute scheme for non-volatile main memory
AU - Alshboul, Mohammad
AU - Elnawawy, Hussein
AU - Elkhouly, Reem
AU - Kimura, Keiji
AU - Tuck, James
AU - Solihin, Yan
N1 - Funding Information:
Extension of Conference Paper: This article is an extension of work originally presented in PACT 2017 [23]. This work is supported in part by the National Science Foundation through awards CNS-171748, 1829142, and 1914717. Opinions expressed in this article are solely the authors’ and not necessarily those of NSF. Alshboul is a Ph.D. student of Solihin. Elnawawy was a Ph.D. student of Solihin at the time of him contributing to this work. Authors’ addresses: M. Alshboul and H. Elnawawy, North Carolina State University, USA; emails: {maalshbo, hmelnawa}@ ncsu.edu; R. Elkhouly, Tanta University, Egypt and Waseda University, Japan; email: reem_elkhouly@f-eng.tanta.edu.eg; K. Kimura, Waseda University, Japan; J. Tuck, North Carolina State University, USA; email: jtuck@ncsu.edu; Y. Solihin, University of Central Florida, USA; email: yan.solihin@ucf.edu. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM. 1544-3566/2019/05-ART18 https://doi.org/10.1145/3323091
Publisher Copyright:
© 2019 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2019/5
Y1 - 2019/5
N2 - Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in the literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance. In this article, we propose a novel recompute-based failure safety approach and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance 8 at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing. We also conduct experiments on real hardware, allowing us to run our workloads to completion while varying the number of threads used for computation. These experiments substantiate our simulation-based observations and provide a sensitivity study and performance comparison between the Recompute Scheme and Naive Checkpointing.
AB - Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in the literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance. In this article, we propose a novel recompute-based failure safety approach and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance 8 at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing. We also conduct experiments on real hardware, allowing us to run our workloads to completion while varying the number of threads used for computation. These experiments substantiate our simulation-based observations and provide a sensitivity study and performance comparison between the Recompute Scheme and Naive Checkpointing.
KW - Computer architecture
KW - Emerging memory technologies
KW - Memory systems
UR - http://www.scopus.com/inward/record.url?scp=85069190431&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85069190431&partnerID=8YFLogxK
U2 - 10.1145/3323091
DO - 10.1145/3323091
M3 - Article
AN - SCOPUS:85069190431
VL - 16
JO - Transactions on Architecture and Code Optimization
JF - Transactions on Architecture and Code Optimization
SN - 1544-3566
IS - 2
M1 - 18
ER -