GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio

Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Zhao You, Zhiyong Yan

研究成果: Conference contribution

3 被引用数 (Scopus)

抄録

This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 33,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 33,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training sub-sets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.

本文言語English
ホスト出版物のタイトル22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
出版社International Speech Communication Association
ページ4376-4380
ページ数5
ISBN(電子版)9781713836902
DOI
出版ステータスPublished - 2021
外部発表はい
イベント22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 - Brno, Czech Republic
継続期間: 2021 8月 302021 9月 3

出版物シリーズ

名前Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
6
ISSN(印刷版)2308-457X
ISSN(電子版)1990-9772

Conference

Conference22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
国/地域Czech Republic
CityBrno
Period21/8/3021/9/3

ASJC Scopus subject areas

  • 言語および言語学
  • 人間とコンピュータの相互作用
  • 信号処理
  • ソフトウェア
  • モデリングとシミュレーション

フィンガープリント

「GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル