Robust language modeling for a small corpus of target tasks using class-combined word statistics and selective use of a general corpus

Yosuke Wada*, Norihiko Kobayashi, Tetsunori Kobayashi

*この研究の対応する著者

研究成果: Article査読

2 被引用数 (Scopus)

抄録

In order to improve the accuracy of language models in speech recognition tasks for which collecting a large text corpus for language model training is difficult, we propose a class-combined bigram and selective use of general text. In the class-combined bigram, the word bigram and the class bigram are combined using weights that are expressed as the functions of the preceding word frequency and the succeeding word-type count. An experimental has shown that the accuracy of the proposed class-combined bigram is equivalent to that of the word bigram trained with a text corpus that is approximately three times larger. In the selective use of general text, the language model was corrected by automatically selecting sentences that were expected to produce better accuracy from a large volume of text collected without specifying the task and by adding these sentences to a small corpus of target tasks. An experiment has shown that the recognition error rate was reduced by up to 12% compared to a case in which text was not selected. Lastly, when we created a model that uses both the class-combined bigram and text addition, further improvements were obtained, resulting in improvements of approximately 34% in adjusted perplexity and approximately 31% in the recognition error rate compared to the word bigram created from the target task text only.

本文言語English
ページ(範囲)92-102
ページ数11
ジャーナルSystems and Computers in Japan
34
12
DOI
出版ステータスPublished - 2003 11 15

ASJC Scopus subject areas

  • 理論的コンピュータサイエンス
  • 情報システム
  • ハードウェアとアーキテクチャ
  • 計算理論と計算数学

フィンガープリント

「Robust language modeling for a small corpus of target tasks using class-combined word statistics and selective use of a general corpus」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル