Extracting representative phrases from Wikipedia article sections

Shan Liu, Mizuho Iwaihara

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Nowadays, Wikipedia has become one of the most important tools for searching information. Since its long articles are taking time to read, as well as section titles are sometimes too short to capture comprehensive summarization, we aim at extracting informative phrases that readers can refer to. Existing work on topic labelling works effectively and performs well on document categorization, but inadequate for granularity of detailed contents. Besides, existing keyphrase construction methods just perform well on very short texts. So we try to extract phrases which represent the target section content well among other sections within the same Wikipedia article. We also incorporate related external articles to increase candidate phrases. Then we apply FP-growth to obtain frequently co-occurring word sets. After that, we apply improved features which characterize desired properties from different aspects. Then, we apply gradient descent on our ranking function to obtain reasonable weighting on the features. For evaluation, we combine Normalized Google Distance (NGD) and nDCG to measure semantic relatedness between generated phrases and hidden original section titles.

Original languageEnglish
Title of host publication2016 IEEE/ACIS 15th International Conference on Computer and Information Science, ICIS 2016 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781509008063
DOIs
Publication statusPublished - 2016 Aug 23
Event15th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2016 - Okayama, Japan
Duration: 2016 Jun 262016 Jun 29

Other

Other15th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2016
CountryJapan
CityOkayama
Period16/6/2616/6/29

Fingerprint

Wikipedia
Labeling
Semantics
Ranking Function
Gradient Descent
Summarization
Categorization
Granularity
Weighting
Target
Evaluation

Keywords

  • Co-occurring Word Sets
  • Gradient Descent
  • Wikipedia

ASJC Scopus subject areas

  • Computer Science(all)
  • Energy Engineering and Power Technology
  • Control and Optimization

Cite this

Liu, S., & Iwaihara, M. (2016). Extracting representative phrases from Wikipedia article sections. In 2016 IEEE/ACIS 15th International Conference on Computer and Information Science, ICIS 2016 - Proceedings [7550850] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICIS.2016.7550850

Extracting representative phrases from Wikipedia article sections. / Liu, Shan; Iwaihara, Mizuho.

2016 IEEE/ACIS 15th International Conference on Computer and Information Science, ICIS 2016 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2016. 7550850.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Liu, S & Iwaihara, M 2016, Extracting representative phrases from Wikipedia article sections. in 2016 IEEE/ACIS 15th International Conference on Computer and Information Science, ICIS 2016 - Proceedings., 7550850, Institute of Electrical and Electronics Engineers Inc., 15th IEEE/ACIS International Conference on Computer and Information Science, ICIS 2016, Okayama, Japan, 16/6/26. https://doi.org/10.1109/ICIS.2016.7550850
Liu S, Iwaihara M. Extracting representative phrases from Wikipedia article sections. In 2016 IEEE/ACIS 15th International Conference on Computer and Information Science, ICIS 2016 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2016. 7550850 https://doi.org/10.1109/ICIS.2016.7550850
Liu, Shan ; Iwaihara, Mizuho. / Extracting representative phrases from Wikipedia article sections. 2016 IEEE/ACIS 15th International Conference on Computer and Information Science, ICIS 2016 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2016.
@inproceedings{8fabd95a72774f0b89cdafecd339466a,
title = "Extracting representative phrases from Wikipedia article sections",
abstract = "Nowadays, Wikipedia has become one of the most important tools for searching information. Since its long articles are taking time to read, as well as section titles are sometimes too short to capture comprehensive summarization, we aim at extracting informative phrases that readers can refer to. Existing work on topic labelling works effectively and performs well on document categorization, but inadequate for granularity of detailed contents. Besides, existing keyphrase construction methods just perform well on very short texts. So we try to extract phrases which represent the target section content well among other sections within the same Wikipedia article. We also incorporate related external articles to increase candidate phrases. Then we apply FP-growth to obtain frequently co-occurring word sets. After that, we apply improved features which characterize desired properties from different aspects. Then, we apply gradient descent on our ranking function to obtain reasonable weighting on the features. For evaluation, we combine Normalized Google Distance (NGD) and nDCG to measure semantic relatedness between generated phrases and hidden original section titles.",
keywords = "Co-occurring Word Sets, Gradient Descent, Wikipedia",
author = "Shan Liu and Mizuho Iwaihara",
year = "2016",
month = "8",
day = "23",
doi = "10.1109/ICIS.2016.7550850",
language = "English",
booktitle = "2016 IEEE/ACIS 15th International Conference on Computer and Information Science, ICIS 2016 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - Extracting representative phrases from Wikipedia article sections

AU - Liu, Shan

AU - Iwaihara, Mizuho

PY - 2016/8/23

Y1 - 2016/8/23

N2 - Nowadays, Wikipedia has become one of the most important tools for searching information. Since its long articles are taking time to read, as well as section titles are sometimes too short to capture comprehensive summarization, we aim at extracting informative phrases that readers can refer to. Existing work on topic labelling works effectively and performs well on document categorization, but inadequate for granularity of detailed contents. Besides, existing keyphrase construction methods just perform well on very short texts. So we try to extract phrases which represent the target section content well among other sections within the same Wikipedia article. We also incorporate related external articles to increase candidate phrases. Then we apply FP-growth to obtain frequently co-occurring word sets. After that, we apply improved features which characterize desired properties from different aspects. Then, we apply gradient descent on our ranking function to obtain reasonable weighting on the features. For evaluation, we combine Normalized Google Distance (NGD) and nDCG to measure semantic relatedness between generated phrases and hidden original section titles.

AB - Nowadays, Wikipedia has become one of the most important tools for searching information. Since its long articles are taking time to read, as well as section titles are sometimes too short to capture comprehensive summarization, we aim at extracting informative phrases that readers can refer to. Existing work on topic labelling works effectively and performs well on document categorization, but inadequate for granularity of detailed contents. Besides, existing keyphrase construction methods just perform well on very short texts. So we try to extract phrases which represent the target section content well among other sections within the same Wikipedia article. We also incorporate related external articles to increase candidate phrases. Then we apply FP-growth to obtain frequently co-occurring word sets. After that, we apply improved features which characterize desired properties from different aspects. Then, we apply gradient descent on our ranking function to obtain reasonable weighting on the features. For evaluation, we combine Normalized Google Distance (NGD) and nDCG to measure semantic relatedness between generated phrases and hidden original section titles.

KW - Co-occurring Word Sets

KW - Gradient Descent

KW - Wikipedia

UR - http://www.scopus.com/inward/record.url?scp=84987968851&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84987968851&partnerID=8YFLogxK

U2 - 10.1109/ICIS.2016.7550850

DO - 10.1109/ICIS.2016.7550850

M3 - Conference contribution

BT - 2016 IEEE/ACIS 15th International Conference on Computer and Information Science, ICIS 2016 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -