Does speech enhancement work with end-to-end ASR objectives?

Experimental analysis of multichannel end-to-end ASR

Tsubasa Ochiai, Shinji Watanabe, Shigeru Katagiri

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Recently we proposed a novel multichannel end-to-end speech recognition architecture that integrates the components of multichannel speech enhancement and speech recognition into a single neural-network-based architecture and demonstrated its fundamental utility for automatic speech recognition (ASR). However, the behavior of the proposed integrated system remains insufficiently clarified. An open question is whether the speech enhancement component really gains speech enhancement (noise suppression) ability, because it is optimized based on end-to-end ASR objectives instead of speech enhancement objectives. In this paper, we solve this question by conducting systematic evaluation experiments using the CHiME-4 corpus. We first show that the integrated end-to-end architecture successfully obtains adequate speech enhancement ability that is superior to that of a conventional alternative (a delay-and-sum beamformer) by observing two signal-level measures: the signal-todistortion ratio and the perceptual evaluation of speech quality. Our findings suggest that to further increase the performances of an integrated system, we must boost the power of the latter-stage speech recognition component. However, an insufficient amount of multichannel noisy speech data is available. Based on these situations, we next investigate the effect of using a large amount of single-channel clean speech data, e.g., the WSJ corpus, for additional training of the speech recognition component. We also show that our approach with clean speech significantly improves the total performance of multichannel end-to-end architecture in the multichannel noisy ASR tasks.

Original languageEnglish
Title of host publication2017 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2017 - Proceedings
PublisherIEEE Computer Society
Pages1-5
Number of pages5
Volume2017-September
ISBN (Electronic)9781509063413
DOIs
Publication statusPublished - 2017 Dec 5
Externally publishedYes
Event2017 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2017 - Tokyo, Japan
Duration: 2017 Sep 252017 Sep 28

Other

Other2017 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2017
CountryJapan
CityTokyo
Period17/9/2517/9/28

Fingerprint

Speech enhancement
Speech recognition
Neural networks

Keywords

  • Encoder-decoder network
  • Multichannel end-to-end automatic speech recognition
  • Neural beamformer

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Signal Processing

Cite this

Ochiai, T., Watanabe, S., & Katagiri, S. (2017). Does speech enhancement work with end-to-end ASR objectives? Experimental analysis of multichannel end-to-end ASR. In 2017 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2017 - Proceedings (Vol. 2017-September, pp. 1-5). IEEE Computer Society. https://doi.org/10.1109/MLSP.2017.8168188

Does speech enhancement work with end-to-end ASR objectives? Experimental analysis of multichannel end-to-end ASR. / Ochiai, Tsubasa; Watanabe, Shinji; Katagiri, Shigeru.

2017 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2017 - Proceedings. Vol. 2017-September IEEE Computer Society, 2017. p. 1-5.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ochiai, T, Watanabe, S & Katagiri, S 2017, Does speech enhancement work with end-to-end ASR objectives? Experimental analysis of multichannel end-to-end ASR. in 2017 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2017 - Proceedings. vol. 2017-September, IEEE Computer Society, pp. 1-5, 2017 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2017, Tokyo, Japan, 17/9/25. https://doi.org/10.1109/MLSP.2017.8168188
Ochiai T, Watanabe S, Katagiri S. Does speech enhancement work with end-to-end ASR objectives? Experimental analysis of multichannel end-to-end ASR. In 2017 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2017 - Proceedings. Vol. 2017-September. IEEE Computer Society. 2017. p. 1-5 https://doi.org/10.1109/MLSP.2017.8168188
Ochiai, Tsubasa ; Watanabe, Shinji ; Katagiri, Shigeru. / Does speech enhancement work with end-to-end ASR objectives? Experimental analysis of multichannel end-to-end ASR. 2017 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2017 - Proceedings. Vol. 2017-September IEEE Computer Society, 2017. pp. 1-5
@inproceedings{29caf3689211486d8fb8ec141fbda7e0,
title = "Does speech enhancement work with end-to-end ASR objectives?: Experimental analysis of multichannel end-to-end ASR",
abstract = "Recently we proposed a novel multichannel end-to-end speech recognition architecture that integrates the components of multichannel speech enhancement and speech recognition into a single neural-network-based architecture and demonstrated its fundamental utility for automatic speech recognition (ASR). However, the behavior of the proposed integrated system remains insufficiently clarified. An open question is whether the speech enhancement component really gains speech enhancement (noise suppression) ability, because it is optimized based on end-to-end ASR objectives instead of speech enhancement objectives. In this paper, we solve this question by conducting systematic evaluation experiments using the CHiME-4 corpus. We first show that the integrated end-to-end architecture successfully obtains adequate speech enhancement ability that is superior to that of a conventional alternative (a delay-and-sum beamformer) by observing two signal-level measures: the signal-todistortion ratio and the perceptual evaluation of speech quality. Our findings suggest that to further increase the performances of an integrated system, we must boost the power of the latter-stage speech recognition component. However, an insufficient amount of multichannel noisy speech data is available. Based on these situations, we next investigate the effect of using a large amount of single-channel clean speech data, e.g., the WSJ corpus, for additional training of the speech recognition component. We also show that our approach with clean speech significantly improves the total performance of multichannel end-to-end architecture in the multichannel noisy ASR tasks.",
keywords = "Encoder-decoder network, Multichannel end-to-end automatic speech recognition, Neural beamformer",
author = "Tsubasa Ochiai and Shinji Watanabe and Shigeru Katagiri",
year = "2017",
month = "12",
day = "5",
doi = "10.1109/MLSP.2017.8168188",
language = "English",
volume = "2017-September",
pages = "1--5",
booktitle = "2017 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2017 - Proceedings",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - Does speech enhancement work with end-to-end ASR objectives?

T2 - Experimental analysis of multichannel end-to-end ASR

AU - Ochiai, Tsubasa

AU - Watanabe, Shinji

AU - Katagiri, Shigeru

PY - 2017/12/5

Y1 - 2017/12/5

N2 - Recently we proposed a novel multichannel end-to-end speech recognition architecture that integrates the components of multichannel speech enhancement and speech recognition into a single neural-network-based architecture and demonstrated its fundamental utility for automatic speech recognition (ASR). However, the behavior of the proposed integrated system remains insufficiently clarified. An open question is whether the speech enhancement component really gains speech enhancement (noise suppression) ability, because it is optimized based on end-to-end ASR objectives instead of speech enhancement objectives. In this paper, we solve this question by conducting systematic evaluation experiments using the CHiME-4 corpus. We first show that the integrated end-to-end architecture successfully obtains adequate speech enhancement ability that is superior to that of a conventional alternative (a delay-and-sum beamformer) by observing two signal-level measures: the signal-todistortion ratio and the perceptual evaluation of speech quality. Our findings suggest that to further increase the performances of an integrated system, we must boost the power of the latter-stage speech recognition component. However, an insufficient amount of multichannel noisy speech data is available. Based on these situations, we next investigate the effect of using a large amount of single-channel clean speech data, e.g., the WSJ corpus, for additional training of the speech recognition component. We also show that our approach with clean speech significantly improves the total performance of multichannel end-to-end architecture in the multichannel noisy ASR tasks.

AB - Recently we proposed a novel multichannel end-to-end speech recognition architecture that integrates the components of multichannel speech enhancement and speech recognition into a single neural-network-based architecture and demonstrated its fundamental utility for automatic speech recognition (ASR). However, the behavior of the proposed integrated system remains insufficiently clarified. An open question is whether the speech enhancement component really gains speech enhancement (noise suppression) ability, because it is optimized based on end-to-end ASR objectives instead of speech enhancement objectives. In this paper, we solve this question by conducting systematic evaluation experiments using the CHiME-4 corpus. We first show that the integrated end-to-end architecture successfully obtains adequate speech enhancement ability that is superior to that of a conventional alternative (a delay-and-sum beamformer) by observing two signal-level measures: the signal-todistortion ratio and the perceptual evaluation of speech quality. Our findings suggest that to further increase the performances of an integrated system, we must boost the power of the latter-stage speech recognition component. However, an insufficient amount of multichannel noisy speech data is available. Based on these situations, we next investigate the effect of using a large amount of single-channel clean speech data, e.g., the WSJ corpus, for additional training of the speech recognition component. We also show that our approach with clean speech significantly improves the total performance of multichannel end-to-end architecture in the multichannel noisy ASR tasks.

KW - Encoder-decoder network

KW - Multichannel end-to-end automatic speech recognition

KW - Neural beamformer

UR - http://www.scopus.com/inward/record.url?scp=85042305174&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85042305174&partnerID=8YFLogxK

U2 - 10.1109/MLSP.2017.8168188

DO - 10.1109/MLSP.2017.8168188

M3 - Conference contribution

VL - 2017-September

SP - 1

EP - 5

BT - 2017 IEEE International Workshop on Machine Learning for Signal Processing, MLSP 2017 - Proceedings

PB - IEEE Computer Society

ER -