An integrated approach to measuring semantic similarity between words using information available on the Web

Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Citations (Scopus)

Abstract

Measuring semantic similarity between words is vital for various applications in natural language processing, such as language modeling, information retrieval, and document clustering. We propose a method that utilizes the information available on the Web to measure semantic similarity between a pair of words or entities. We integrate page counts for each word in the pair and lexico-syntactic patterns that occur among the top ranking snippets for the AND query using support vector machines. Experimental results on Miller Charles' benchmark data set show that the proposed measure outperforms all the existing web based semantic similarity measures by a wide margin, achieving a correlation coefficient of 0.834. Moreover, the proposed semantic similarity measure significantly improves the accuracy (F-measure of 0.78) in a named entity clustering task, proving the capability of the proposed measure to capture semantic similarity using web content.

Original languageEnglish
Title of host publicationNAACL HLT 2007 - Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference
Pages340-347
Number of pages8
Publication statusPublished - 2007
Externally publishedYes
EventHuman Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, NAACL HLT 2007 - Rochester, NY
Duration: 2007 Apr 222007 Apr 27

Other

OtherHuman Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, NAACL HLT 2007
CityRochester, NY
Period07/4/2207/4/27

Fingerprint

available information
semantics
language
information retrieval
ranking
World Wide Web
Semantic Similarity
Entity

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Cite this

Bollegala, D., Matsuo, Y., & Ishizuka, M. (2007). An integrated approach to measuring semantic similarity between words using information available on the Web. In NAACL HLT 2007 - Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference (pp. 340-347)

An integrated approach to measuring semantic similarity between words using information available on the Web. / Bollegala, Danushka; Matsuo, Yutaka; Ishizuka, Mitsuru.

NAACL HLT 2007 - Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference. 2007. p. 340-347.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Bollegala, D, Matsuo, Y & Ishizuka, M 2007, An integrated approach to measuring semantic similarity between words using information available on the Web. in NAACL HLT 2007 - Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference. pp. 340-347, Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, NAACL HLT 2007, Rochester, NY, 07/4/22.
Bollegala D, Matsuo Y, Ishizuka M. An integrated approach to measuring semantic similarity between words using information available on the Web. In NAACL HLT 2007 - Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference. 2007. p. 340-347
Bollegala, Danushka ; Matsuo, Yutaka ; Ishizuka, Mitsuru. / An integrated approach to measuring semantic similarity between words using information available on the Web. NAACL HLT 2007 - Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference. 2007. pp. 340-347
@inproceedings{164a3e8a36d3473a9072740f3f602c29,
title = "An integrated approach to measuring semantic similarity between words using information available on the Web",
abstract = "Measuring semantic similarity between words is vital for various applications in natural language processing, such as language modeling, information retrieval, and document clustering. We propose a method that utilizes the information available on the Web to measure semantic similarity between a pair of words or entities. We integrate page counts for each word in the pair and lexico-syntactic patterns that occur among the top ranking snippets for the AND query using support vector machines. Experimental results on Miller Charles' benchmark data set show that the proposed measure outperforms all the existing web based semantic similarity measures by a wide margin, achieving a correlation coefficient of 0.834. Moreover, the proposed semantic similarity measure significantly improves the accuracy (F-measure of 0.78) in a named entity clustering task, proving the capability of the proposed measure to capture semantic similarity using web content.",
author = "Danushka Bollegala and Yutaka Matsuo and Mitsuru Ishizuka",
year = "2007",
language = "English",
pages = "340--347",
booktitle = "NAACL HLT 2007 - Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference",

}

TY - GEN

T1 - An integrated approach to measuring semantic similarity between words using information available on the Web

AU - Bollegala, Danushka

AU - Matsuo, Yutaka

AU - Ishizuka, Mitsuru

PY - 2007

Y1 - 2007

N2 - Measuring semantic similarity between words is vital for various applications in natural language processing, such as language modeling, information retrieval, and document clustering. We propose a method that utilizes the information available on the Web to measure semantic similarity between a pair of words or entities. We integrate page counts for each word in the pair and lexico-syntactic patterns that occur among the top ranking snippets for the AND query using support vector machines. Experimental results on Miller Charles' benchmark data set show that the proposed measure outperforms all the existing web based semantic similarity measures by a wide margin, achieving a correlation coefficient of 0.834. Moreover, the proposed semantic similarity measure significantly improves the accuracy (F-measure of 0.78) in a named entity clustering task, proving the capability of the proposed measure to capture semantic similarity using web content.

AB - Measuring semantic similarity between words is vital for various applications in natural language processing, such as language modeling, information retrieval, and document clustering. We propose a method that utilizes the information available on the Web to measure semantic similarity between a pair of words or entities. We integrate page counts for each word in the pair and lexico-syntactic patterns that occur among the top ranking snippets for the AND query using support vector machines. Experimental results on Miller Charles' benchmark data set show that the proposed measure outperforms all the existing web based semantic similarity measures by a wide margin, achieving a correlation coefficient of 0.834. Moreover, the proposed semantic similarity measure significantly improves the accuracy (F-measure of 0.78) in a named entity clustering task, proving the capability of the proposed measure to capture semantic similarity using web content.

UR - http://www.scopus.com/inward/record.url?scp=77249092619&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77249092619&partnerID=8YFLogxK

M3 - Conference contribution

SP - 340

EP - 347

BT - NAACL HLT 2007 - Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Main Conference

ER -