Abstract
Word clustering is important for automatic thesaurus construction, text classification, and word sense disambiguation. Recently, several studies have reported using the web as a corpus. This paper proposes an unsupervised algorithm for word clustering based on a word similarity measure by web counts. Each pair of words is queried to a search engine, which produces a co-occurrence matrix. By calculating the similarity of words, a word cooccurrence graph is obtained. A new kind of graph clustering algorithm called Newman clustering is applied for efficiently identifying word clusters. Evaluations are made on two sets of word groups derived from a web directory and WordNet.
Original language | English |
---|---|
Title of host publication | COLING/ACL 2006 - EMNLP 2006: 2006 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference |
Pages | 542-550 |
Number of pages | 9 |
Publication status | Published - 2006 |
Externally published | Yes |
Event | 11th Conference on Empirical Methods in Natural Language Proceessing, EMNLP 2006, Held in Conjunction with COLING/ACL 2006 - Sydney, NSW Duration: 2006 Jul 22 → 2006 Jul 23 |
Other
Other | 11th Conference on Empirical Methods in Natural Language Proceessing, EMNLP 2006, Held in Conjunction with COLING/ACL 2006 |
---|---|
City | Sydney, NSW |
Period | 06/7/22 → 06/7/23 |
ASJC Scopus subject areas
- Computational Theory and Mathematics
- Computer Science Applications
- Information Systems