Estimating reference scopes of wikipedia article inner-links

Renzhi Wang, Mizuho Iwaihara

Research output: Contribution to journalArticle

Abstract

Wikipedia is the largest online encyclopedia, utilized as machine-knowledgeable and semantic resources. Links within Wikipedia indicate that two linked articles or parts of them are related each other about their topics. Existing link detection methods focus on linking to article titles, because most of links in Wikipedia point to article titles. But there is a number of links in Wikipedia pointing to corresponding specific segments, such as paragraphs, because the whole article is too general and it is hard for readers to obtain the intention of the link. We propose a method to automatically predict whether a link target is a specific segment or the whole article, and evaluate which segment is most relevant. We propose a combination method of Latent Dirichlet Allocation (LDA) and Maximum Likelihood Estimation (MLE) to represent every segment as a vector, and then we obtain similarity of each segment pair. Finally, we utilize variance, standard deviation and other statistical features to produce prediction results. We also apply word embeddings to embed all the segments into a semantic space and calculate cosine similarities between segment pairs. Then we utilize Random Forest to train a classifier to predict link scopes. Evaluations on Wikipedia articles show an ensemble of the proposed features achieved the best results.

Original languageEnglish
Pages (from-to)562-570
Number of pages9
JournalJournal of Information Processing
Volume26
DOIs
Publication statusPublished - 2018 Jan 1

Fingerprint

Semantics
Maximum likelihood estimation
Classifiers

Keywords

  • Lda
  • Link suggestion
  • PMI
  • Wikipedia
  • Word embedding

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Estimating reference scopes of wikipedia article inner-links. / Wang, Renzhi; Iwaihara, Mizuho.

In: Journal of Information Processing, Vol. 26, 01.01.2018, p. 562-570.

Research output: Contribution to journalArticle

@article{5558880670734eb78d323155fe06b35d,
title = "Estimating reference scopes of wikipedia article inner-links",
abstract = "Wikipedia is the largest online encyclopedia, utilized as machine-knowledgeable and semantic resources. Links within Wikipedia indicate that two linked articles or parts of them are related each other about their topics. Existing link detection methods focus on linking to article titles, because most of links in Wikipedia point to article titles. But there is a number of links in Wikipedia pointing to corresponding specific segments, such as paragraphs, because the whole article is too general and it is hard for readers to obtain the intention of the link. We propose a method to automatically predict whether a link target is a specific segment or the whole article, and evaluate which segment is most relevant. We propose a combination method of Latent Dirichlet Allocation (LDA) and Maximum Likelihood Estimation (MLE) to represent every segment as a vector, and then we obtain similarity of each segment pair. Finally, we utilize variance, standard deviation and other statistical features to produce prediction results. We also apply word embeddings to embed all the segments into a semantic space and calculate cosine similarities between segment pairs. Then we utilize Random Forest to train a classifier to predict link scopes. Evaluations on Wikipedia articles show an ensemble of the proposed features achieved the best results.",
keywords = "Lda, Link suggestion, PMI, Wikipedia, Word embedding",
author = "Renzhi Wang and Mizuho Iwaihara",
year = "2018",
month = "1",
day = "1",
doi = "10.2197/ipsjjip.26.562",
language = "English",
volume = "26",
pages = "562--570",
journal = "Journal of Information Processing",
issn = "0387-5806",
publisher = "Information Processing Society of Japan",

}

TY - JOUR

T1 - Estimating reference scopes of wikipedia article inner-links

AU - Wang, Renzhi

AU - Iwaihara, Mizuho

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Wikipedia is the largest online encyclopedia, utilized as machine-knowledgeable and semantic resources. Links within Wikipedia indicate that two linked articles or parts of them are related each other about their topics. Existing link detection methods focus on linking to article titles, because most of links in Wikipedia point to article titles. But there is a number of links in Wikipedia pointing to corresponding specific segments, such as paragraphs, because the whole article is too general and it is hard for readers to obtain the intention of the link. We propose a method to automatically predict whether a link target is a specific segment or the whole article, and evaluate which segment is most relevant. We propose a combination method of Latent Dirichlet Allocation (LDA) and Maximum Likelihood Estimation (MLE) to represent every segment as a vector, and then we obtain similarity of each segment pair. Finally, we utilize variance, standard deviation and other statistical features to produce prediction results. We also apply word embeddings to embed all the segments into a semantic space and calculate cosine similarities between segment pairs. Then we utilize Random Forest to train a classifier to predict link scopes. Evaluations on Wikipedia articles show an ensemble of the proposed features achieved the best results.

AB - Wikipedia is the largest online encyclopedia, utilized as machine-knowledgeable and semantic resources. Links within Wikipedia indicate that two linked articles or parts of them are related each other about their topics. Existing link detection methods focus on linking to article titles, because most of links in Wikipedia point to article titles. But there is a number of links in Wikipedia pointing to corresponding specific segments, such as paragraphs, because the whole article is too general and it is hard for readers to obtain the intention of the link. We propose a method to automatically predict whether a link target is a specific segment or the whole article, and evaluate which segment is most relevant. We propose a combination method of Latent Dirichlet Allocation (LDA) and Maximum Likelihood Estimation (MLE) to represent every segment as a vector, and then we obtain similarity of each segment pair. Finally, we utilize variance, standard deviation and other statistical features to produce prediction results. We also apply word embeddings to embed all the segments into a semantic space and calculate cosine similarities between segment pairs. Then we utilize Random Forest to train a classifier to predict link scopes. Evaluations on Wikipedia articles show an ensemble of the proposed features achieved the best results.

KW - Lda

KW - Link suggestion

KW - PMI

KW - Wikipedia

KW - Word embedding

UR - http://www.scopus.com/inward/record.url?scp=85052398479&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85052398479&partnerID=8YFLogxK

U2 - 10.2197/ipsjjip.26.562

DO - 10.2197/ipsjjip.26.562

M3 - Article

AN - SCOPUS:85052398479

VL - 26

SP - 562

EP - 570

JO - Journal of Information Processing

JF - Journal of Information Processing

SN - 0387-5806

ER -