A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension

Hugo Hernault, Danushka Bollegala, Mitsuru Ishizuka

Research output: Chapter in Book/Report/Conference proceedingConference contribution

25 Citations (Scopus)

Abstract

Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, un-labeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of co-occurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Tree-bank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.

Original languageEnglish
Title of host publicationEMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
Pages399-409
Number of pages11
Publication statusPublished - 2010
Externally publishedYes
EventConference on Empirical Methods in Natural Language Processing, EMNLP 2010 - Cambridge, MA
Duration: 2010 Oct 92010 Oct 11

Other

OtherConference on Empirical Methods in Natural Language Processing, EMNLP 2010
CityCambridge, MA
Period10/10/910/10/11

Fingerprint

Macros
Classifiers
Learning systems

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Cite this

Hernault, H., Bollegala, D., & Ishizuka, M. (2010). A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension. In EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 399-409)

A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension. / Hernault, Hugo; Bollegala, Danushka; Ishizuka, Mitsuru.

EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2010. p. 399-409.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hernault, H, Bollegala, D & Ishizuka, M 2010, A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension. in EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. pp. 399-409, Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, Cambridge, MA, 10/10/9.
Hernault H, Bollegala D, Ishizuka M. A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension. In EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2010. p. 399-409
Hernault, Hugo ; Bollegala, Danushka ; Ishizuka, Mitsuru. / A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension. EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference. 2010. pp. 399-409
@inproceedings{2e44d2b9f73a48a4a484808752ca0c1e,
title = "A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension",
abstract = "Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, un-labeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of co-occurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Tree-bank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50{\%} compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.",
author = "Hugo Hernault and Danushka Bollegala and Mitsuru Ishizuka",
year = "2010",
language = "English",
isbn = "1932432868",
pages = "399--409",
booktitle = "EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference",

}

TY - GEN

T1 - A semi-supervised approach to improve classification of infrequent discourse relations using feature vector extension

AU - Hernault, Hugo

AU - Bollegala, Danushka

AU - Ishizuka, Mitsuru

PY - 2010

Y1 - 2010

N2 - Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, un-labeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of co-occurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Tree-bank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.

AB - Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, un-labeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of co-occurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Tree-bank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.

UR - http://www.scopus.com/inward/record.url?scp=79952272600&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79952272600&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:79952272600

SN - 1932432868

SN - 9781932432862

SP - 399

EP - 409

BT - EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

ER -