A proposal of extended cosine measure for distance metric learning in text classification

Kenta Mikawa*, Takashi Ishida, Masayuki Goto

*この研究の対応する著者

研究成果: Conference contribution

15 被引用数 (Scopus)

抄録

This paper discusses a new similarity measure between documents on a vector space model from the view point of distance metric learning. The documents are represented by points in the vector space by using the information of frequencies of words appearing in each document. The similarity measure between two different documents is useful to recognize the relationship and can be applied to classification or clustering of the data. Usually, the cosine similarity and the Euclid distance have been used in order to measure the similarity between points in the Euclidean space. However, these measures do not take the correlation among words which appear in documents into consideration on an application of the vector space model to document analysis. Generally speaking, many words which appear in documents have correlation to one another depending on the sentence structures, topics and subjects. Therefore, it is effective to build a suitable metric measure taking the correlation of words into consideration on the vector space in order to improve the performance of document classification and clustering. This paper presents a new effective method to acquire a distance measure on the document vector space based on an extended cosine measure. In addition, the way of distance metric learning is proposed to acquire the proper metric from the view point of supervised learning. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of the customer review which is posted on the web site and the newspaper article.

本文言語English
ホスト出版物のタイトル2011 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2011 - Conference Digest
ページ1741-1746
ページ数6
DOI
出版ステータスPublished - 2011 12 23
イベント2011 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2011 - Anchorage, AK, United States
継続期間: 2011 10 92011 10 12

出版物シリーズ

名前Conference Proceedings - IEEE International Conference on Systems, Man and Cybernetics
ISSN(印刷版)1062-922X

Other

Other2011 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2011
国/地域United States
CityAnchorage, AK
Period11/10/911/10/12

ASJC Scopus subject areas

  • 電子工学および電気工学
  • 制御およびシステム工学
  • 人間とコンピュータの相互作用

フィンガープリント

「A proposal of extended cosine measure for distance metric learning in text classification」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル