SCTB-V2: the 2nd version of the Chinese treebank in the scientific domain

Chenhui Chu*, Zhuoyuan Mao, Toshiaki Nakazawa, Daisuke Kawahara, Sadao Kurohashi

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Word segmentation, part-of-speech (POS) tagging, and syntactic parsing are three fundamental Chinese analysis tasks for Chinese language processing, which are also crucial for various downstream tasks such as machine translation and information extraction. To achieve high accuracy for these tasks, treebanks that contain sentences manually annotated with word segmentation, part-of-speech tags, and phrase structures are essential. Although there are large-scale Chinese treebanks in the news domain, such treebanks are unavailable in the scientific domain. This significantly limits the performance of Chinese language processing for scientific text. To address this problem, we annotate the 2nd version of the Chinese treebank in the scientific domain (SCTB-V2). SCTB-V2 contains 12,175 sentences annotated with word segmentation, part-of-speech tags, and phrase structures. We conducted Chinese analyses and machine translation experiments on SCTB-V2. The results show the effectiveness of SCTB-V2. We release this treebank to promote scientific Chinese language processing research http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?A%20Chinese%20Treebank%20 in%20Scientific%20Domain%20%28SCTB%29.

Original languageEnglish
JournalLanguage Resources and Evaluation
DOIs
Publication statusAccepted/In press - 2022

Keywords

  • Chinese
  • Scientific domain
  • Treebank

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Linguistics and Language
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'SCTB-V2: the 2nd version of the Chinese treebank in the scientific domain'. Together they form a unique fingerprint.

Cite this