Graph-based word clustering using a web search engine

Yutaka Matsuo, Takeshi Sakaki, Kôki Uchiyama, Mitsuru Ishizuka

Research output: Chapter in Book/Report/Conference proceedingConference contribution

63 Citations (Scopus)

Abstract

Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts. Each pair of words is queried to a search engine, which produces a co-occurrence matrix. By calculating the similarity of words, a word cooccurrence graph is obtained. A new kind of graph clustering algorithm called Newman clustering is applied for efficiently identifying word clusters. Evaluations are made on two sets of word groups derived from a web directory and WordNet.

Original languageEnglish
Title of host publicationCOLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Pages542-550
Number of pages9
Publication statusPublished - 2006
Externally publishedYes
Event11th Conference on Empirical Methods in Natural Language Proceessing, EMNLP 2006, Held in Conjunction with COLING/ACL 2006 - Sydney, NSW
Duration: 2006 Jul 222006 Jul 23

Other

Other11th Conference on Empirical Methods in Natural Language Proceessing, EMNLP 2006, Held in Conjunction with COLING/ACL 2006
CitySydney, NSW
Period06/7/2206/7/23

Fingerprint

Thesauri
Search engines
Clustering algorithms

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Cite this

Matsuo, Y., Sakaki, T., Uchiyama, K., & Ishizuka, M. (2006). Graph-based word clustering using a web search engine. In COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 542-550)

Graph-based word clustering using a web search engine. / Matsuo, Yutaka; Sakaki, Takeshi; Uchiyama, Kôki; Ishizuka, Mitsuru.

COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2006. p. 542-550.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Matsuo, Y, Sakaki, T, Uchiyama, K & Ishizuka, M 2006, Graph-based word clustering using a web search engine. in COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. pp. 542-550, 11th Conference on Empirical Methods in Natural Language Proceessing, EMNLP 2006, Held in Conjunction with COLING/ACL 2006, Sydney, NSW, 06/7/22.
Matsuo Y, Sakaki T, Uchiyama K, Ishizuka M. Graph-based word clustering using a web search engine. In COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2006. p. 542-550
Matsuo, Yutaka ; Sakaki, Takeshi ; Uchiyama, Kôki ; Ishizuka, Mitsuru. / Graph-based word clustering using a web search engine. COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2006. pp. 542-550
@inproceedings{6224b64bbe76470d8e7480dbd8131633,
title = "Graph-based word clustering using a web search engine",
abstract = "Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts. Each pair of words is queried to a search engine, which produces a co-occurrence matrix. By calculating the similarity of words, a word cooccurrence graph is obtained. A new kind of graph clustering algorithm called Newman clustering is applied for efficiently identifying word clusters. Evaluations are made on two sets of word groups derived from a web directory and WordNet.",
author = "Yutaka Matsuo and Takeshi Sakaki and K{\^o}ki Uchiyama and Mitsuru Ishizuka",
year = "2006",
language = "English",
isbn = "1932432736",
pages = "542--550",
booktitle = "COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference",

}

TY - GEN

T1 - Graph-based word clustering using a web search engine

AU - Matsuo, Yutaka

AU - Sakaki, Takeshi

AU - Uchiyama, Kôki

AU - Ishizuka, Mitsuru

PY - 2006

Y1 - 2006

N2 - Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts. Each pair of words is queried to a search engine, which produces a co-occurrence matrix. By calculating the similarity of words, a word cooccurrence graph is obtained. A new kind of graph clustering algorithm called Newman clustering is applied for efficiently identifying word clusters. Evaluations are made on two sets of word groups derived from a web directory and WordNet.

AB - Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts. Each pair of words is queried to a search engine, which produces a co-occurrence matrix. By calculating the similarity of words, a word cooccurrence graph is obtained. A new kind of graph clustering algorithm called Newman clustering is applied for efficiently identifying word clusters. Evaluations are made on two sets of word groups derived from a web directory and WordNet.

UR - http://www.scopus.com/inward/record.url?scp=80053343581&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053343581&partnerID=8YFLogxK

M3 - Conference contribution

SN - 1932432736

SN - 9781932432732

SP - 542

EP - 550

BT - COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

ER -