Bootstrapping K-means for big data analysis

Jungkyu Han, Min Luo

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

In recent years, 'Big data' has become a popular word in industrial field. Distributed data processing middleware such as Hadoop makes companies to be able to extract useful information from their big data. However, information retrieval from newly available big data is difficult even with the aid of distributed data processing because the task needs many cycles of hypothesis establishment and test due to lack of prior knowledge about the data. K-means algorithm is one of popular algorithms which can be used in earlier stages of data mining because of the algorithm's speed and unsupervised characteristics. However, with big data, even k-means algorithm is not fast enough to get a desired result in an expected time period. In the paper, we propose a fast k-means method based on statistical bootstrapping technique. Our proposed method achieves roughly 100 times speedup and similar accuracy compared to Lloyd algorithm which is the most popular k-means algorithm in industrial field.

Original languageEnglish
Title of host publicationProceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages591-596
Number of pages6
ISBN (Electronic)9781479956654
DOIs
Publication statusPublished - 2015 Jan 7
Externally publishedYes
Event2nd IEEE International Conference on Big Data, IEEE Big Data 2014 - Washington
Duration: 2014 Oct 272014 Oct 30

Other

Other2nd IEEE International Conference on Big Data, IEEE Big Data 2014
CityWashington
Period14/10/2714/10/30

Fingerprint

Middleware
Information retrieval
Data mining
Big data
Industry

Keywords

  • Big data
  • Bootstapping
  • Bootstrap
  • Clustering
  • k-means

ASJC Scopus subject areas

  • Artificial Intelligence
  • Information Systems

Cite this

Han, J., & Luo, M. (2015). Bootstrapping K-means for big data analysis. In Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014 (pp. 591-596). [7004279] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.2014.7004279

Bootstrapping K-means for big data analysis. / Han, Jungkyu; Luo, Min.

Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. Institute of Electrical and Electronics Engineers Inc., 2015. p. 591-596 7004279.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Han, J & Luo, M 2015, Bootstrapping K-means for big data analysis. in Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014., 7004279, Institute of Electrical and Electronics Engineers Inc., pp. 591-596, 2nd IEEE International Conference on Big Data, IEEE Big Data 2014, Washington, 14/10/27. https://doi.org/10.1109/BigData.2014.7004279
Han J, Luo M. Bootstrapping K-means for big data analysis. In Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. Institute of Electrical and Electronics Engineers Inc. 2015. p. 591-596. 7004279 https://doi.org/10.1109/BigData.2014.7004279
Han, Jungkyu ; Luo, Min. / Bootstrapping K-means for big data analysis. Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. Institute of Electrical and Electronics Engineers Inc., 2015. pp. 591-596
@inproceedings{51f6e0c635b445f1b5fd25f5f66f51af,
title = "Bootstrapping K-means for big data analysis",
abstract = "In recent years, 'Big data' has become a popular word in industrial field. Distributed data processing middleware such as Hadoop makes companies to be able to extract useful information from their big data. However, information retrieval from newly available big data is difficult even with the aid of distributed data processing because the task needs many cycles of hypothesis establishment and test due to lack of prior knowledge about the data. K-means algorithm is one of popular algorithms which can be used in earlier stages of data mining because of the algorithm's speed and unsupervised characteristics. However, with big data, even k-means algorithm is not fast enough to get a desired result in an expected time period. In the paper, we propose a fast k-means method based on statistical bootstrapping technique. Our proposed method achieves roughly 100 times speedup and similar accuracy compared to Lloyd algorithm which is the most popular k-means algorithm in industrial field.",
keywords = "Big data, Bootstapping, Bootstrap, Clustering, k-means",
author = "Jungkyu Han and Min Luo",
year = "2015",
month = "1",
day = "7",
doi = "10.1109/BigData.2014.7004279",
language = "English",
pages = "591--596",
booktitle = "Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - Bootstrapping K-means for big data analysis

AU - Han, Jungkyu

AU - Luo, Min

PY - 2015/1/7

Y1 - 2015/1/7

N2 - In recent years, 'Big data' has become a popular word in industrial field. Distributed data processing middleware such as Hadoop makes companies to be able to extract useful information from their big data. However, information retrieval from newly available big data is difficult even with the aid of distributed data processing because the task needs many cycles of hypothesis establishment and test due to lack of prior knowledge about the data. K-means algorithm is one of popular algorithms which can be used in earlier stages of data mining because of the algorithm's speed and unsupervised characteristics. However, with big data, even k-means algorithm is not fast enough to get a desired result in an expected time period. In the paper, we propose a fast k-means method based on statistical bootstrapping technique. Our proposed method achieves roughly 100 times speedup and similar accuracy compared to Lloyd algorithm which is the most popular k-means algorithm in industrial field.

AB - In recent years, 'Big data' has become a popular word in industrial field. Distributed data processing middleware such as Hadoop makes companies to be able to extract useful information from their big data. However, information retrieval from newly available big data is difficult even with the aid of distributed data processing because the task needs many cycles of hypothesis establishment and test due to lack of prior knowledge about the data. K-means algorithm is one of popular algorithms which can be used in earlier stages of data mining because of the algorithm's speed and unsupervised characteristics. However, with big data, even k-means algorithm is not fast enough to get a desired result in an expected time period. In the paper, we propose a fast k-means method based on statistical bootstrapping technique. Our proposed method achieves roughly 100 times speedup and similar accuracy compared to Lloyd algorithm which is the most popular k-means algorithm in industrial field.

KW - Big data

KW - Bootstapping

KW - Bootstrap

KW - Clustering

KW - k-means

UR - http://www.scopus.com/inward/record.url?scp=84921725140&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84921725140&partnerID=8YFLogxK

U2 - 10.1109/BigData.2014.7004279

DO - 10.1109/BigData.2014.7004279

M3 - Conference contribution

SP - 591

EP - 596

BT - Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

PB - Institute of Electrical and Electronics Engineers Inc.

ER -