Personal name disambiguation is an important task in social network extraction, evaluation and integration of ontologies, information retrieval, cross-document coreference resolution and word sense disambiguation. We propose an unsupervised method to automatically annotate people with ambiguous names on the Web using automatically extracted keywords. Given an ambiguous personal name, first, we download text snippets for the given name from a Web search engine. We then represent each instance of the ambiguous name by a term-entity model (TEM), a model that we propose to represent the Web appearance of an individual. A TEM of a person captures named entities and attribute values that are useful to disambiguate that person from his or her namesakes (i.e., different people who share the same name). We then use group average agglomerative clustering to identify the instances of an ambiguous name that belong to the same person. Ideally, each cluster must represent a different namesake. However, in practice it is not possible to know the number of namesakes for a given ambiguous personal name in advance. To circumvent this problem, we propose a novel normalized cuts-based cluster stopping criterion to determine the different people on the Web for a given ambiguous name. Finally, we annotate each person with an ambiguous name using keywords selected from the clusters. We evaluate the proposed method on a data set of over 2500 documents covering 200 different people for 20 ambiguous names. Experimental results show that the proposed method outperforms numerous baselines and previously proposed name disambiguation methods. Moreover, the extracted keywords reduce ambiguity of a name in an information retrieval task, which underscores the usefulness of the proposed method in real-world scenarios.
ASJC Scopus subject areas