Measuring semantic similarity between words using web search engines

Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka

Research output: Chapter in Book/Report/Conference proceedingConference contribution

397 Citations (Scopus)

Abstract

Semantic similarity measures play important roles in information retrieval and Natural Language Processing. Previous work in semantic web-related applications such as community mining, relation extraction, automatic meta data extraction have used various semantic similarity measures. Despite the usefulness of semantic similarity measures in these applications, robustly measuring semantic similarity between two words (or entities) remains a challenging task. We propose a robust semantic similarity measure that uses the information available on the Web to measure similarity between words or entities. The proposed method exploits page counts and text snippets returned by a Web search engine. We dene various similarity scores for two given words P and Q, using the page counts for the queries P, Q and P AND Q. Moreover, we propose a novel approach to compute semantic similarity using automatically extracted lexico-syntactic patterns from text snippets. These different similarity scores are integrated using support vector machines, to leverage a robust semantic similarity measure. Experimental results on Miller-Charles benchmark dataset show that the proposed measure outperforms all the existing web-based semantic similarity measures by a wide margin, achieving a correlation coeficient of 0:834. Moreover, the proposed semantic similarity measure significantly improves the accuracy (F-measure of 0:78) in a community mining task, and in an entity disambiguation task, thereby verifying the capability of the proposed measure to capture semantic similarity using web content.

Original languageEnglish
Title of host publication16th International World Wide Web Conference, WWW2007
Pages757-766
Number of pages10
DOIs
Publication statusPublished - 2007
Externally publishedYes
Event16th International World Wide Web Conference, WWW2007 - Banff, AB
Duration: 2007 May 82007 May 12

Other

Other16th International World Wide Web Conference, WWW2007
CityBanff, AB
Period07/5/807/5/12

Fingerprint

Search engines
Semantics
Syntactics
Semantic Web
Metadata
Information retrieval
Support vector machines

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Software

Cite this

Bollegala, D., Matsuo, Y., & Ishizuka, M. (2007). Measuring semantic similarity between words using web search engines. In 16th International World Wide Web Conference, WWW2007 (pp. 757-766) https://doi.org/10.1145/1242572.1242675

Measuring semantic similarity between words using web search engines. / Bollegala, Danushka; Matsuo, Yutaka; Ishizuka, Mitsuru.

16th International World Wide Web Conference, WWW2007. 2007. p. 757-766.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Bollegala, D, Matsuo, Y & Ishizuka, M 2007, Measuring semantic similarity between words using web search engines. in 16th International World Wide Web Conference, WWW2007. pp. 757-766, 16th International World Wide Web Conference, WWW2007, Banff, AB, 07/5/8. https://doi.org/10.1145/1242572.1242675
Bollegala D, Matsuo Y, Ishizuka M. Measuring semantic similarity between words using web search engines. In 16th International World Wide Web Conference, WWW2007. 2007. p. 757-766 https://doi.org/10.1145/1242572.1242675
Bollegala, Danushka ; Matsuo, Yutaka ; Ishizuka, Mitsuru. / Measuring semantic similarity between words using web search engines. 16th International World Wide Web Conference, WWW2007. 2007. pp. 757-766
@inproceedings{def17df5ba514047bd1a199f7c014f3c,
title = "Measuring semantic similarity between words using web search engines",
abstract = "Semantic similarity measures play important roles in information retrieval and Natural Language Processing. Previous work in semantic web-related applications such as community mining, relation extraction, automatic meta data extraction have used various semantic similarity measures. Despite the usefulness of semantic similarity measures in these applications, robustly measuring semantic similarity between two words (or entities) remains a challenging task. We propose a robust semantic similarity measure that uses the information available on the Web to measure similarity between words or entities. The proposed method exploits page counts and text snippets returned by a Web search engine. We dene various similarity scores for two given words P and Q, using the page counts for the queries P, Q and P AND Q. Moreover, we propose a novel approach to compute semantic similarity using automatically extracted lexico-syntactic patterns from text snippets. These different similarity scores are integrated using support vector machines, to leverage a robust semantic similarity measure. Experimental results on Miller-Charles benchmark dataset show that the proposed measure outperforms all the existing web-based semantic similarity measures by a wide margin, achieving a correlation coeficient of 0:834. Moreover, the proposed semantic similarity measure significantly improves the accuracy (F-measure of 0:78) in a community mining task, and in an entity disambiguation task, thereby verifying the capability of the proposed measure to capture semantic similarity using web content.",
author = "Danushka Bollegala and Yutaka Matsuo and Mitsuru Ishizuka",
year = "2007",
doi = "10.1145/1242572.1242675",
language = "English",
isbn = "1595936548",
pages = "757--766",
booktitle = "16th International World Wide Web Conference, WWW2007",

}

TY - GEN

T1 - Measuring semantic similarity between words using web search engines

AU - Bollegala, Danushka

AU - Matsuo, Yutaka

AU - Ishizuka, Mitsuru

PY - 2007

Y1 - 2007

N2 - Semantic similarity measures play important roles in information retrieval and Natural Language Processing. Previous work in semantic web-related applications such as community mining, relation extraction, automatic meta data extraction have used various semantic similarity measures. Despite the usefulness of semantic similarity measures in these applications, robustly measuring semantic similarity between two words (or entities) remains a challenging task. We propose a robust semantic similarity measure that uses the information available on the Web to measure similarity between words or entities. The proposed method exploits page counts and text snippets returned by a Web search engine. We dene various similarity scores for two given words P and Q, using the page counts for the queries P, Q and P AND Q. Moreover, we propose a novel approach to compute semantic similarity using automatically extracted lexico-syntactic patterns from text snippets. These different similarity scores are integrated using support vector machines, to leverage a robust semantic similarity measure. Experimental results on Miller-Charles benchmark dataset show that the proposed measure outperforms all the existing web-based semantic similarity measures by a wide margin, achieving a correlation coeficient of 0:834. Moreover, the proposed semantic similarity measure significantly improves the accuracy (F-measure of 0:78) in a community mining task, and in an entity disambiguation task, thereby verifying the capability of the proposed measure to capture semantic similarity using web content.

AB - Semantic similarity measures play important roles in information retrieval and Natural Language Processing. Previous work in semantic web-related applications such as community mining, relation extraction, automatic meta data extraction have used various semantic similarity measures. Despite the usefulness of semantic similarity measures in these applications, robustly measuring semantic similarity between two words (or entities) remains a challenging task. We propose a robust semantic similarity measure that uses the information available on the Web to measure similarity between words or entities. The proposed method exploits page counts and text snippets returned by a Web search engine. We dene various similarity scores for two given words P and Q, using the page counts for the queries P, Q and P AND Q. Moreover, we propose a novel approach to compute semantic similarity using automatically extracted lexico-syntactic patterns from text snippets. These different similarity scores are integrated using support vector machines, to leverage a robust semantic similarity measure. Experimental results on Miller-Charles benchmark dataset show that the proposed measure outperforms all the existing web-based semantic similarity measures by a wide margin, achieving a correlation coeficient of 0:834. Moreover, the proposed semantic similarity measure significantly improves the accuracy (F-measure of 0:78) in a community mining task, and in an entity disambiguation task, thereby verifying the capability of the proposed measure to capture semantic similarity using web content.

UR - http://www.scopus.com/inward/record.url?scp=35348903881&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=35348903881&partnerID=8YFLogxK

U2 - 10.1145/1242572.1242675

DO - 10.1145/1242572.1242675

M3 - Conference contribution

SN - 1595936548

SN - 9781595936547

SP - 757

EP - 766

BT - 16th International World Wide Web Conference, WWW2007

ER -