Efficient checkpointing with recompute scheme for non-volatile main memory

Mohammad Alshboul, Hussein Elnawawy, Reem Elkhouly, Keiji Kimura, James Tuck, Yan Solihin

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in the literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance. In this article, we propose a novel recompute-based failure safety approach and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance 8 at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing. We also conduct experiments on real hardware, allowing us to run our workloads to completion while varying the number of threads used for computation. These experiments substantiate our simulation-based observations and provide a sensitivity study and performance comparison between the Recompute Scheme and Naive Checkpointing.

Original languageEnglish
Article number18
JournalACM Transactions on Architecture and Code Optimization
Volume16
Issue number2
DOIs
Publication statusPublished - 2019 May

Fingerprint

Data storage equipment
Durability
Computer systems
Experiments
Hardware
Recovery

Keywords

  • Computer architecture
  • Emerging memory technologies
  • Memory systems

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Hardware and Architecture

Cite this

Efficient checkpointing with recompute scheme for non-volatile main memory. / Alshboul, Mohammad; Elnawawy, Hussein; Elkhouly, Reem; Kimura, Keiji; Tuck, James; Solihin, Yan.

In: ACM Transactions on Architecture and Code Optimization, Vol. 16, No. 2, 18, 05.2019.

Research output: Contribution to journalArticle

Alshboul, Mohammad ; Elnawawy, Hussein ; Elkhouly, Reem ; Kimura, Keiji ; Tuck, James ; Solihin, Yan. / Efficient checkpointing with recompute scheme for non-volatile main memory. In: ACM Transactions on Architecture and Code Optimization. 2019 ; Vol. 16, No. 2.
@article{61842fde4aae40d9a9163fdbd7202ad3,
title = "Efficient checkpointing with recompute scheme for non-volatile main memory",
abstract = "Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in the literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance. In this article, we propose a novel recompute-based failure safety approach and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance 8 at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5{\%}, in contrast to 8{\%} overhead with logging and 207{\%} overhead with checkpointing. Furthermore, recompute only adds 7{\%} additional NVMM writes, compared to 111{\%} with logging and 330{\%} with checkpointing. We also conduct experiments on real hardware, allowing us to run our workloads to completion while varying the number of threads used for computation. These experiments substantiate our simulation-based observations and provide a sensitivity study and performance comparison between the Recompute Scheme and Naive Checkpointing.",
keywords = "Computer architecture, Emerging memory technologies, Memory systems",
author = "Mohammad Alshboul and Hussein Elnawawy and Reem Elkhouly and Keiji Kimura and James Tuck and Yan Solihin",
year = "2019",
month = "5",
doi = "10.1145/3323091",
language = "English",
volume = "16",
journal = "Transactions on Architecture and Code Optimization",
issn = "1544-3566",
publisher = "Association for Computing Machinery (ACM)",
number = "2",

}

TY - JOUR

T1 - Efficient checkpointing with recompute scheme for non-volatile main memory

AU - Alshboul, Mohammad

AU - Elnawawy, Hussein

AU - Elkhouly, Reem

AU - Kimura, Keiji

AU - Tuck, James

AU - Solihin, Yan

PY - 2019/5

Y1 - 2019/5

N2 - Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in the literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance. In this article, we propose a novel recompute-based failure safety approach and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance 8 at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing. We also conduct experiments on real hardware, allowing us to run our workloads to completion while varying the number of threads used for computation. These experiments substantiate our simulation-based observations and provide a sensitivity study and performance comparison between the Recompute Scheme and Naive Checkpointing.

AB - Future main memory will likely include Non-Volatile Memory. Non-Volatile Main Memory (NVMM) provides an opportunity to rethink checkpointing strategies for providing failure safety to applications. While there are many checkpointing and logging schemes in the literature, their use must be revisited as they incur high execution time overheads as well as a large number of additional writes to NVMM, which may significantly impact write endurance. In this article, we propose a novel recompute-based failure safety approach and demonstrate its applicability to loop-based code. Rather than keeping a fully consistent logging state, we only log enough state to enable recomputation. Upon a failure, our approach recovers to a consistent state by determining which parts of the computation were not completed and recomputing them. Effectively, our approach removes the need to keep checkpoints or logs, thus reducing execution time overheads and improving NVMM write endurance 8 at the expense of more complex recovery. We compare our new approach against logging and checkpointing on five scientific workloads, including tiled matrix multiplication, on a computer system model that was built on gem5 and supports Intel PMEM instruction extensions. For tiled matrix multiplication, our recompute approach incurs an execution time overhead of only 5%, in contrast to 8% overhead with logging and 207% overhead with checkpointing. Furthermore, recompute only adds 7% additional NVMM writes, compared to 111% with logging and 330% with checkpointing. We also conduct experiments on real hardware, allowing us to run our workloads to completion while varying the number of threads used for computation. These experiments substantiate our simulation-based observations and provide a sensitivity study and performance comparison between the Recompute Scheme and Naive Checkpointing.

KW - Computer architecture

KW - Emerging memory technologies

KW - Memory systems

UR - http://www.scopus.com/inward/record.url?scp=85069190431&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85069190431&partnerID=8YFLogxK

U2 - 10.1145/3323091

DO - 10.1145/3323091

M3 - Article

AN - SCOPUS:85069190431

VL - 16

JO - Transactions on Architecture and Code Optimization

JF - Transactions on Architecture and Code Optimization

SN - 1544-3566

IS - 2

M1 - 18

ER -