Unified Architecture for Multichannel End-to-End Speech Recognition with Neural Beamforming

Tsubasa Ochiai*, Shinji Watanabe, Takaaki Hori, John R. Hershey, Xiong Xiao

*この研究の対応する著者

研究成果: Article査読

50 被引用数 (Scopus)

抄録

This paper proposes a unified architecture for end-to-end automatic speech recognition (ASR) to encompass microphone-array signal processing such as a state-of-the-art neural beamformer within the end-to-end framework. Recently, the end-to-end ASR paradigm has attracted great research interest as an alternative to conventional hybrid paradigms with deep neural networks and hidden Markov models. Using this novel paradigm, we simplify ASR architecture by integrating such ASR components as acoustic, phonetic, and language models with a single neural network and optimize the overall components for the end-to-end ASR objective: generating a correct label sequence. Although most existing end-to-end frameworks have mainly focused on ASR in clean environments, our aim is to build more realistic end-to-end systems in noisy environments. To handle such challenging noisy ASR tasks, we study multichannel end-to-end ASR architecture, which directly converts multichannel speech signal to text through speech enhancement. This architecture allows speech enhancement and ASR components to be jointly optimized to improve the end-to-end ASR objective and leads to an end-to-end framework that works well in the presence of strong background noise. We elaborate the effectiveness of our proposed method on the multichannel ASR benchmarks in noisy environments (CHiME-4 and AMI). The experimental results show that our proposed multichannel end-to-end system obtained performance gains over the conventional end-to-end baseline with enhanced inputs from a delay-and-sum beamformer (i.e., BeamformIT) in terms of character error rate. In addition, further analysis shows that our neural beamformer, which is optimized only with the end-to-end ASR objective, successfully learned a noise suppression function.

本文言語English
論文番号8070987
ページ(範囲)1274-1288
ページ数15
ジャーナルIEEE Journal on Selected Topics in Signal Processing
11
8
DOI
出版ステータスPublished - 2017 12月
外部発表はい

ASJC Scopus subject areas

  • 信号処理
  • 電子工学および電気工学

フィンガープリント

「Unified Architecture for Multichannel End-to-End Speech Recognition with Neural Beamforming」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル