A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension

Hugo Hernault*, Danushka Bollegala, Mitsuru Ishizuka

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

33 Citations (Scopus)

Abstract

Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, un-labeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of co-occurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Tree-bank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.

Original languageEnglish
Title of host publicationEMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Pages399-409
Number of pages11
Publication statusPublished - 2010
Externally publishedYes
EventConference on Empirical Methods in Natural Language Processing, EMNLP 2010 - Cambridge, MA
Duration: 2010 Oct 92010 Oct 11

Other

OtherConference on Empirical Methods in Natural Language Processing, EMNLP 2010
CityCambridge, MA
Period10/10/910/10/11

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Fingerprint

Dive into the research topics of 'A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension'. Together they form a unique fingerprint.

Cite this