Abstract
Several recent discourse parsers have employed fully-supervised machine learning approaches. These methods require human annotators to beforehand create an extensive training corpus, which is a time-consuming and costly process. On the other hand, un-labeled data is abundant and cheap to collect. In this paper, we propose a novel semi-supervised method for discourse relation classification based on the analysis of co-occurring features in unlabeled data, which is then taken into account for extending the feature vectors given to a classifier. Our experimental results on the RST Discourse Tree-bank corpus and Penn Discourse Treebank indicate that the proposed method brings a significant improvement in classification accuracy and macro-average F-score when small training datasets are used. For instance, with training sets of c.a. 1000 labeled instances, the proposed method brings improvements in accuracy and macro-average F-score up to 50% compared to a baseline classifier. We believe that the proposed method is a first step towards detecting low-occurrence relations, which is useful for domains with a lack of annotated data.
Original language | English |
---|---|
Title of host publication | EMNLP 2010 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference |
Pages | 399-409 |
Number of pages | 11 |
Publication status | Published - 2010 |
Externally published | Yes |
Event | Conference on Empirical Methods in Natural Language Processing, EMNLP 2010 - Cambridge, MA Duration: 2010 Oct 9 → 2010 Oct 11 |
Other
Other | Conference on Empirical Methods in Natural Language Processing, EMNLP 2010 |
---|---|
City | Cambridge, MA |
Period | 10/10/9 → 10/10/11 |
ASJC Scopus subject areas
- Computational Theory and Mathematics
- Computer Science Applications
- Information Systems