Design and performance evaluation for Hadoop clusters on virtualized environment

Masakuni Ishii, Jungkyu Han, Hiroyuki Makino

Research output: Chapter in Book/Report/Conference proceedingConference contribution

18 Citations (Scopus)

Abstract

Hadoop an implementation of Google's MapReduce, is widely used in these days for big data analysis. Yahoo Inc. operated 25 PB with 25,000 nodes in 2010. The resource management for such large number of nodes is quite difficult from the aspects of configuration, deployment and efficient resource utilization. By deploying virtual machines (VMs), Hadoop management becomes much easier. Amazon already released the Hadoop on Xen-virtualized environment as Elastic MapReduce. However, Hadoop on VM clusters degrades its performance due to the overhead of the virtualization. Thus, it is important to minimize the overhead. We build a Hadoop performance model and examine how the performance is affected by changing VM configuration, allocation of VMs over physical machines, and multiplicity of jobs. We find that performance of the I/O-intensive jobs is more sensitive to the virtualization overhead than that of CPU-intensive jobs. The performance degradation caused by the VM configuration change is 55% at most and the one caused by allocation change is 18% at most for I/O-intensive jobs. For I/O intensive jobs, the best practice is to increase the number of VMs and not to increase the number of VCPUs in a VM, to allocate VMs widely over physical servers, and to decrease the number of simultaneous executed jobs. The main factor of virtualization overhead is disk I/O shared by multiple VMs in a physical server.

Original languageEnglish
Title of host publicationInternational Conference on Information Networking 2013, ICOIN 2013
Pages244-249
Number of pages6
DOIs
Publication statusPublished - 2013
Externally publishedYes
Event27th International Conference on Information Networking, ICOIN 2013 - Bangkok
Duration: 2013 Jan 272013 Jan 30

Other

Other27th International Conference on Information Networking, ICOIN 2013
CityBangkok
Period13/1/2713/1/30

Fingerprint

Servers
Virtual machine
Program processors
Degradation
Virtualization
Big data

Keywords

  • cluster
  • Hadoop performance evaluation
  • KVM
  • virtual machine
  • virtualized environment

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems

Cite this

Ishii, M., Han, J., & Makino, H. (2013). Design and performance evaluation for Hadoop clusters on virtualized environment. In International Conference on Information Networking 2013, ICOIN 2013 (pp. 244-249). [6496384] https://doi.org/10.1109/ICOIN.2013.6496384

Design and performance evaluation for Hadoop clusters on virtualized environment. / Ishii, Masakuni; Han, Jungkyu; Makino, Hiroyuki.

International Conference on Information Networking 2013, ICOIN 2013. 2013. p. 244-249 6496384.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ishii, M, Han, J & Makino, H 2013, Design and performance evaluation for Hadoop clusters on virtualized environment. in International Conference on Information Networking 2013, ICOIN 2013., 6496384, pp. 244-249, 27th International Conference on Information Networking, ICOIN 2013, Bangkok, 13/1/27. https://doi.org/10.1109/ICOIN.2013.6496384
Ishii M, Han J, Makino H. Design and performance evaluation for Hadoop clusters on virtualized environment. In International Conference on Information Networking 2013, ICOIN 2013. 2013. p. 244-249. 6496384 https://doi.org/10.1109/ICOIN.2013.6496384
Ishii, Masakuni ; Han, Jungkyu ; Makino, Hiroyuki. / Design and performance evaluation for Hadoop clusters on virtualized environment. International Conference on Information Networking 2013, ICOIN 2013. 2013. pp. 244-249
@inproceedings{4be3c23a780845369cf24a2ed4dfbd01,
title = "Design and performance evaluation for Hadoop clusters on virtualized environment",
abstract = "Hadoop an implementation of Google's MapReduce, is widely used in these days for big data analysis. Yahoo Inc. operated 25 PB with 25,000 nodes in 2010. The resource management for such large number of nodes is quite difficult from the aspects of configuration, deployment and efficient resource utilization. By deploying virtual machines (VMs), Hadoop management becomes much easier. Amazon already released the Hadoop on Xen-virtualized environment as Elastic MapReduce. However, Hadoop on VM clusters degrades its performance due to the overhead of the virtualization. Thus, it is important to minimize the overhead. We build a Hadoop performance model and examine how the performance is affected by changing VM configuration, allocation of VMs over physical machines, and multiplicity of jobs. We find that performance of the I/O-intensive jobs is more sensitive to the virtualization overhead than that of CPU-intensive jobs. The performance degradation caused by the VM configuration change is 55{\%} at most and the one caused by allocation change is 18{\%} at most for I/O-intensive jobs. For I/O intensive jobs, the best practice is to increase the number of VMs and not to increase the number of VCPUs in a VM, to allocate VMs widely over physical servers, and to decrease the number of simultaneous executed jobs. The main factor of virtualization overhead is disk I/O shared by multiple VMs in a physical server.",
keywords = "cluster, Hadoop performance evaluation, KVM, virtual machine, virtualized environment",
author = "Masakuni Ishii and Jungkyu Han and Hiroyuki Makino",
year = "2013",
doi = "10.1109/ICOIN.2013.6496384",
language = "English",
isbn = "9781467357401",
pages = "244--249",
booktitle = "International Conference on Information Networking 2013, ICOIN 2013",

}

TY - GEN

T1 - Design and performance evaluation for Hadoop clusters on virtualized environment

AU - Ishii, Masakuni

AU - Han, Jungkyu

AU - Makino, Hiroyuki

PY - 2013

Y1 - 2013

N2 - Hadoop an implementation of Google's MapReduce, is widely used in these days for big data analysis. Yahoo Inc. operated 25 PB with 25,000 nodes in 2010. The resource management for such large number of nodes is quite difficult from the aspects of configuration, deployment and efficient resource utilization. By deploying virtual machines (VMs), Hadoop management becomes much easier. Amazon already released the Hadoop on Xen-virtualized environment as Elastic MapReduce. However, Hadoop on VM clusters degrades its performance due to the overhead of the virtualization. Thus, it is important to minimize the overhead. We build a Hadoop performance model and examine how the performance is affected by changing VM configuration, allocation of VMs over physical machines, and multiplicity of jobs. We find that performance of the I/O-intensive jobs is more sensitive to the virtualization overhead than that of CPU-intensive jobs. The performance degradation caused by the VM configuration change is 55% at most and the one caused by allocation change is 18% at most for I/O-intensive jobs. For I/O intensive jobs, the best practice is to increase the number of VMs and not to increase the number of VCPUs in a VM, to allocate VMs widely over physical servers, and to decrease the number of simultaneous executed jobs. The main factor of virtualization overhead is disk I/O shared by multiple VMs in a physical server.

AB - Hadoop an implementation of Google's MapReduce, is widely used in these days for big data analysis. Yahoo Inc. operated 25 PB with 25,000 nodes in 2010. The resource management for such large number of nodes is quite difficult from the aspects of configuration, deployment and efficient resource utilization. By deploying virtual machines (VMs), Hadoop management becomes much easier. Amazon already released the Hadoop on Xen-virtualized environment as Elastic MapReduce. However, Hadoop on VM clusters degrades its performance due to the overhead of the virtualization. Thus, it is important to minimize the overhead. We build a Hadoop performance model and examine how the performance is affected by changing VM configuration, allocation of VMs over physical machines, and multiplicity of jobs. We find that performance of the I/O-intensive jobs is more sensitive to the virtualization overhead than that of CPU-intensive jobs. The performance degradation caused by the VM configuration change is 55% at most and the one caused by allocation change is 18% at most for I/O-intensive jobs. For I/O intensive jobs, the best practice is to increase the number of VMs and not to increase the number of VCPUs in a VM, to allocate VMs widely over physical servers, and to decrease the number of simultaneous executed jobs. The main factor of virtualization overhead is disk I/O shared by multiple VMs in a physical server.

KW - cluster

KW - Hadoop performance evaluation

KW - KVM

KW - virtual machine

KW - virtualized environment

UR - http://www.scopus.com/inward/record.url?scp=84876756602&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84876756602&partnerID=8YFLogxK

U2 - 10.1109/ICOIN.2013.6496384

DO - 10.1109/ICOIN.2013.6496384

M3 - Conference contribution

AN - SCOPUS:84876756602

SN - 9781467357401

SP - 244

EP - 249

BT - International Conference on Information Networking 2013, ICOIN 2013

ER -