Detection of mergeable wikipedia articles utilizing multiple similarity measures

Renzhi Wang, Mizuho Iwaihara

Research output: Contribution to journalArticle

Abstract

Wikipedia is the largest online encyclopedia, in which articles are edited by different volunteers with different thoughts and styles. Sometimes two or more articles’ titles are different but the themes of these articles are exactly the same or strongly similar. Administrators and editors are supposed to detect such article pairs and determine whether they should be merged together. We call an article pair is mergeable if it is discussed for possible merge, and a merged article pair is such that the pair is actually merged. In this paper, we propose a method to automatically determine whether an article pair is mergeable or merged. According to Wikipedia Guidelines for article merge, in the duplicate case, the article pairs are covering exactly the same contents. In the overlap case, the article pairs are covering related subjects that have a significant overlap. The content of an overlapped part is similar but the words in the pair can be extensively different, so methods that exploit semantic relatedness are necessary. We consider various textual similarities and semantic relatedness. For integrating word embeddings on the target dataset and the global large corpus, we propose linear and non-linear combinations of multiple embedding results and rebuilding word vectors for evaluating semantic relatedness. We clarify the differences between our method and previous researches for combining multiple word embeddings. We also deal with overlap cases by computing Jaccard similarity between article pairs. We combine Jaccard similarity, common-link article count and word embedding-based relatedness together, to predict whether the article pair should be merged. We explore the relationship between segment-level (paragraph-level) similarity and mergeable/merged article pairs, then propose Multimodal Similarity-Based Merge Prediction (MSBMP) which combines the proposed new features by Random Forest, to predict mergeable/merged article pairs. Our evaluations are performed on real mergeable and merged article pairs. Remarkable superiorities of MSBMP are shown, with apparent improvement from baselines of WikiSearch, TFIDF and word embeddings.

Original languageEnglish
Pages (from-to)178-191
Number of pages14
JournalJournal of information processing
Volume28
DOIs
Publication statusPublished - 2020

Keywords

  • Mergeable article
  • Text mining
  • Wikipedia
  • Word embedding

ASJC Scopus subject areas

  • Computer Science(all)

Fingerprint Dive into the research topics of 'Detection of mergeable wikipedia articles utilizing multiple similarity measures'. Together they form a unique fingerprint.

  • Cite this