Automatic annotation of ambiguous personal names on the web

Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Personal name disambiguation is an important task in social network extraction, evaluation and integration of ontologies, information retrieval, cross-document coreference resolution and word sense disambiguation. We propose an unsupervised method to automatically annotate people with ambiguous names on the Web using automatically extracted keywords. Given an ambiguous personal name, first, we download text snippets for the given name from a Web search engine. We then represent each instance of the ambiguous name by a term-entity model (TEM), a model that we propose to represent the Web appearance of an individual. A TEM of a person captures named entities and attribute values that are useful to disambiguate that person from his or her namesakes (i.e., different people who share the same name). We then use group average agglomerative clustering to identify the instances of an ambiguous name that belong to the same person. Ideally, each cluster must represent a different namesake. However, in practice it is not possible to know the number of namesakes for a given ambiguous personal name in advance. To circumvent this problem, we propose a novel normalized cuts-based cluster stopping criterion to determine the different people on the Web for a given ambiguous name. Finally, we annotate each person with an ambiguous name using keywords selected from the clusters. We evaluate the proposed method on a data set of over 2500 documents covering 200 different people for 20 ambiguous names. Experimental results show that the proposed method outperforms numerous baselines and previously proposed name disambiguation methods. Moreover, the extracted keywords reduce ambiguity of a name in an information retrieval task, which underscores the usefulness of the proposed method in real-world scenarios.

Original languageEnglish
Pages (from-to)398-425
Number of pages28
JournalComputational Intelligence
Volume28
Issue number3
DOIs
Publication statusPublished - 2012 Aug
Externally publishedYes

Fingerprint

Ambiguous
Annotation
Information retrieval
Person
Search engines
Ontology
Information Retrieval
Word Sense Disambiguation
Stopping Criterion
Web Search
Term
Search Engine
Social Networks
Baseline
Covering
Attribute
Clustering
Model
Scenarios
Evaluate

Keywords

  • automatic annotation
  • clustering
  • name disambiguation
  • Web mining

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computational Mathematics

Cite this

Automatic annotation of ambiguous personal names on the web. / Bollegala, Danushka; Matsuo, Yutaka; Ishizuka, Mitsuru.

In: Computational Intelligence, Vol. 28, No. 3, 08.2012, p. 398-425.

Research output: Contribution to journalArticle

Bollegala, Danushka ; Matsuo, Yutaka ; Ishizuka, Mitsuru. / Automatic annotation of ambiguous personal names on the web. In: Computational Intelligence. 2012 ; Vol. 28, No. 3. pp. 398-425.
@article{6bf44a37b9ea4080bdc40f9865139461,
title = "Automatic annotation of ambiguous personal names on the web",
abstract = "Personal name disambiguation is an important task in social network extraction, evaluation and integration of ontologies, information retrieval, cross-document coreference resolution and word sense disambiguation. We propose an unsupervised method to automatically annotate people with ambiguous names on the Web using automatically extracted keywords. Given an ambiguous personal name, first, we download text snippets for the given name from a Web search engine. We then represent each instance of the ambiguous name by a term-entity model (TEM), a model that we propose to represent the Web appearance of an individual. A TEM of a person captures named entities and attribute values that are useful to disambiguate that person from his or her namesakes (i.e., different people who share the same name). We then use group average agglomerative clustering to identify the instances of an ambiguous name that belong to the same person. Ideally, each cluster must represent a different namesake. However, in practice it is not possible to know the number of namesakes for a given ambiguous personal name in advance. To circumvent this problem, we propose a novel normalized cuts-based cluster stopping criterion to determine the different people on the Web for a given ambiguous name. Finally, we annotate each person with an ambiguous name using keywords selected from the clusters. We evaluate the proposed method on a data set of over 2500 documents covering 200 different people for 20 ambiguous names. Experimental results show that the proposed method outperforms numerous baselines and previously proposed name disambiguation methods. Moreover, the extracted keywords reduce ambiguity of a name in an information retrieval task, which underscores the usefulness of the proposed method in real-world scenarios.",
keywords = "automatic annotation, clustering, name disambiguation, Web mining",
author = "Danushka Bollegala and Yutaka Matsuo and Mitsuru Ishizuka",
year = "2012",
month = "8",
doi = "10.1111/j.1467-8640.2012.00449.x",
language = "English",
volume = "28",
pages = "398--425",
journal = "Computational Intelligence",
issn = "0824-7935",
publisher = "Wiley-Blackwell",
number = "3",

}

TY - JOUR

T1 - Automatic annotation of ambiguous personal names on the web

AU - Bollegala, Danushka

AU - Matsuo, Yutaka

AU - Ishizuka, Mitsuru

PY - 2012/8

Y1 - 2012/8

N2 - Personal name disambiguation is an important task in social network extraction, evaluation and integration of ontologies, information retrieval, cross-document coreference resolution and word sense disambiguation. We propose an unsupervised method to automatically annotate people with ambiguous names on the Web using automatically extracted keywords. Given an ambiguous personal name, first, we download text snippets for the given name from a Web search engine. We then represent each instance of the ambiguous name by a term-entity model (TEM), a model that we propose to represent the Web appearance of an individual. A TEM of a person captures named entities and attribute values that are useful to disambiguate that person from his or her namesakes (i.e., different people who share the same name). We then use group average agglomerative clustering to identify the instances of an ambiguous name that belong to the same person. Ideally, each cluster must represent a different namesake. However, in practice it is not possible to know the number of namesakes for a given ambiguous personal name in advance. To circumvent this problem, we propose a novel normalized cuts-based cluster stopping criterion to determine the different people on the Web for a given ambiguous name. Finally, we annotate each person with an ambiguous name using keywords selected from the clusters. We evaluate the proposed method on a data set of over 2500 documents covering 200 different people for 20 ambiguous names. Experimental results show that the proposed method outperforms numerous baselines and previously proposed name disambiguation methods. Moreover, the extracted keywords reduce ambiguity of a name in an information retrieval task, which underscores the usefulness of the proposed method in real-world scenarios.

AB - Personal name disambiguation is an important task in social network extraction, evaluation and integration of ontologies, information retrieval, cross-document coreference resolution and word sense disambiguation. We propose an unsupervised method to automatically annotate people with ambiguous names on the Web using automatically extracted keywords. Given an ambiguous personal name, first, we download text snippets for the given name from a Web search engine. We then represent each instance of the ambiguous name by a term-entity model (TEM), a model that we propose to represent the Web appearance of an individual. A TEM of a person captures named entities and attribute values that are useful to disambiguate that person from his or her namesakes (i.e., different people who share the same name). We then use group average agglomerative clustering to identify the instances of an ambiguous name that belong to the same person. Ideally, each cluster must represent a different namesake. However, in practice it is not possible to know the number of namesakes for a given ambiguous personal name in advance. To circumvent this problem, we propose a novel normalized cuts-based cluster stopping criterion to determine the different people on the Web for a given ambiguous name. Finally, we annotate each person with an ambiguous name using keywords selected from the clusters. We evaluate the proposed method on a data set of over 2500 documents covering 200 different people for 20 ambiguous names. Experimental results show that the proposed method outperforms numerous baselines and previously proposed name disambiguation methods. Moreover, the extracted keywords reduce ambiguity of a name in an information retrieval task, which underscores the usefulness of the proposed method in real-world scenarios.

KW - automatic annotation

KW - clustering

KW - name disambiguation

KW - Web mining

UR - http://www.scopus.com/inward/record.url?scp=84864775282&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84864775282&partnerID=8YFLogxK

U2 - 10.1111/j.1467-8640.2012.00449.x

DO - 10.1111/j.1467-8640.2012.00449.x

M3 - Article

AN - SCOPUS:84864775282

VL - 28

SP - 398

EP - 425

JO - Computational Intelligence

JF - Computational Intelligence

SN - 0824-7935

IS - 3

ER -