Multi-modal data augmentation for end-to-end ASR

Adithya Renduchintala, Shuoyang Ding, Matthew Wiesner, Shinji Watanabe

研究成果: Conference article査読

18 被引用数 (Scopus)

抄録

We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using symbolic input in addition to the traditional acoustic input. This architecture utilizes two separate encoders: one for acoustic input and another for symbolic input, both sharing the attention and decoder parameters. We call this architecture a multi-modal data augmentation network (MMDA), as it can support multi-modal (acoustic and symbolic) input and enables seamless mixing of large text datasets with significantly smaller transcribed speech corpora during training. We study different ways of transforming large text corpora into a symbolic form suitable for training our MMDA network. Our best MMDA setup obtains small improvements on character error rate (CER), and as much as 7-10% relative word error rate (WER) improvement over a baseline both with and without an external language model.

本文言語English
ページ(範囲)2394-2398
ページ数5
ジャーナルProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
2018-September
DOI
出版ステータスPublished - 2018
外部発表はい
イベント19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
継続期間: 2018 9 22018 9 6

ASJC Scopus subject areas

  • 言語および言語学
  • 人間とコンピュータの相互作用
  • 信号処理
  • ソフトウェア
  • モデリングとシミュレーション

フィンガープリント

「Multi-modal data augmentation for end-to-end ASR」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル