Generating similarity cluster of Indonesian languages with semi-supervised clustering

Arbi Haza Nasution, Yohei Murakami, Toru Ishida

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Lexicostatistic and language similarity clusters are useful for computational linguistic researches that depends on language similarity or cognate recognition. Nevertheless, there are no published lexicostatistic/language similarity cluster of Indonesian ethnic languages available. We formulate an approach of creating language similarity clusters by utilizing ASJP database to generate the language similarity matrix, then generate the hierarchical clusters with complete linkage and mean linkage clustering, and further extract two stable clusters with high language similarities. We introduced an extended k-means clustering semi-supervised learning to evaluate the stability level of the hierarchical stable clusters being grouped together despite of changing the number of cluster. The higher the number of the trial, the more likely we can distinctly find the two hierarchical stable clusters in the generated k-clusters. However, for all five experiments, the stability level of the two hierarchical stable clusters is the highest on 5 clusters. Therefore, we take the 5 clusters as the best clusters of Indonesian ethnic languages. Finally, we plot the generated 5 clusters to a geographical map.

Original languageEnglish
Pages (from-to)531-538
Number of pages8
JournalInternational Journal of Electrical and Computer Engineering
Volume9
Issue number1
DOIs
Publication statusPublished - 2019 Feb 1
Externally publishedYes

Fingerprint

Computational linguistics
Supervised learning
Experiments

Keywords

  • Hierarchical clustering
  • K-means clustering
  • Language similarity
  • Lexicostatistic
  • Semi-supervised clustering

ASJC Scopus subject areas

  • Computer Science(all)
  • Electrical and Electronic Engineering

Cite this

Generating similarity cluster of Indonesian languages with semi-supervised clustering. / Nasution, Arbi Haza; Murakami, Yohei; Ishida, Toru.

In: International Journal of Electrical and Computer Engineering, Vol. 9, No. 1, 01.02.2019, p. 531-538.

Research output: Contribution to journalArticle

Nasution, Arbi Haza ; Murakami, Yohei ; Ishida, Toru. / Generating similarity cluster of Indonesian languages with semi-supervised clustering. In: International Journal of Electrical and Computer Engineering. 2019 ; Vol. 9, No. 1. pp. 531-538.
@article{365ea589a09646d1b25e67bef6b631a8,
title = "Generating similarity cluster of Indonesian languages with semi-supervised clustering",
abstract = "Lexicostatistic and language similarity clusters are useful for computational linguistic researches that depends on language similarity or cognate recognition. Nevertheless, there are no published lexicostatistic/language similarity cluster of Indonesian ethnic languages available. We formulate an approach of creating language similarity clusters by utilizing ASJP database to generate the language similarity matrix, then generate the hierarchical clusters with complete linkage and mean linkage clustering, and further extract two stable clusters with high language similarities. We introduced an extended k-means clustering semi-supervised learning to evaluate the stability level of the hierarchical stable clusters being grouped together despite of changing the number of cluster. The higher the number of the trial, the more likely we can distinctly find the two hierarchical stable clusters in the generated k-clusters. However, for all five experiments, the stability level of the two hierarchical stable clusters is the highest on 5 clusters. Therefore, we take the 5 clusters as the best clusters of Indonesian ethnic languages. Finally, we plot the generated 5 clusters to a geographical map.",
keywords = "Hierarchical clustering, K-means clustering, Language similarity, Lexicostatistic, Semi-supervised clustering",
author = "Nasution, {Arbi Haza} and Yohei Murakami and Toru Ishida",
year = "2019",
month = "2",
day = "1",
doi = "10.11591/ijece.v9i1.pp531-538",
language = "English",
volume = "9",
pages = "531--538",
journal = "International Journal of Electrical and Computer Engineering",
issn = "2088-8708",
publisher = "Institute of Advanced Engineering and Science (IAES)",
number = "1",

}

TY - JOUR

T1 - Generating similarity cluster of Indonesian languages with semi-supervised clustering

AU - Nasution, Arbi Haza

AU - Murakami, Yohei

AU - Ishida, Toru

PY - 2019/2/1

Y1 - 2019/2/1

N2 - Lexicostatistic and language similarity clusters are useful for computational linguistic researches that depends on language similarity or cognate recognition. Nevertheless, there are no published lexicostatistic/language similarity cluster of Indonesian ethnic languages available. We formulate an approach of creating language similarity clusters by utilizing ASJP database to generate the language similarity matrix, then generate the hierarchical clusters with complete linkage and mean linkage clustering, and further extract two stable clusters with high language similarities. We introduced an extended k-means clustering semi-supervised learning to evaluate the stability level of the hierarchical stable clusters being grouped together despite of changing the number of cluster. The higher the number of the trial, the more likely we can distinctly find the two hierarchical stable clusters in the generated k-clusters. However, for all five experiments, the stability level of the two hierarchical stable clusters is the highest on 5 clusters. Therefore, we take the 5 clusters as the best clusters of Indonesian ethnic languages. Finally, we plot the generated 5 clusters to a geographical map.

AB - Lexicostatistic and language similarity clusters are useful for computational linguistic researches that depends on language similarity or cognate recognition. Nevertheless, there are no published lexicostatistic/language similarity cluster of Indonesian ethnic languages available. We formulate an approach of creating language similarity clusters by utilizing ASJP database to generate the language similarity matrix, then generate the hierarchical clusters with complete linkage and mean linkage clustering, and further extract two stable clusters with high language similarities. We introduced an extended k-means clustering semi-supervised learning to evaluate the stability level of the hierarchical stable clusters being grouped together despite of changing the number of cluster. The higher the number of the trial, the more likely we can distinctly find the two hierarchical stable clusters in the generated k-clusters. However, for all five experiments, the stability level of the two hierarchical stable clusters is the highest on 5 clusters. Therefore, we take the 5 clusters as the best clusters of Indonesian ethnic languages. Finally, we plot the generated 5 clusters to a geographical map.

KW - Hierarchical clustering

KW - K-means clustering

KW - Language similarity

KW - Lexicostatistic

KW - Semi-supervised clustering

UR - http://www.scopus.com/inward/record.url?scp=85066303482&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85066303482&partnerID=8YFLogxK

U2 - 10.11591/ijece.v9i1.pp531-538

DO - 10.11591/ijece.v9i1.pp531-538

M3 - Article

AN - SCOPUS:85066303482

VL - 9

SP - 531

EP - 538

JO - International Journal of Electrical and Computer Engineering

JF - International Journal of Electrical and Computer Engineering

SN - 2088-8708

IS - 1

ER -