Improved MVDR beamforming using single-channel mask prediction networks

Hakan Erdogan, John Hershey, Shinji Watanabe, Michael Mandel, Jonathan Le Roux

Research output: Contribution to journalArticle

65 Citations (Scopus)

Abstract

Recent studies on multi-microphone speech databases indicate that it is beneficial to perform beamforming to improve speech recognition accuracies, especially when there is a high level of background noise. Minimum variance distortionless response (MVDR) beamforming is an important beamforming method that performs quite well for speech recognition purposes especially if the steering vector is known. However, steering the beamformer to focus on speech in unknown acoustic conditions remains a challenging problem. In this study, we use singlechannel speech enhancement deep networks to form masks that can be used for noise spatial covariance estimation, which steers the MVDR beamforming toward the speech. We analyze how mask prediction affects performance and also discuss various ways to use masks to obtain the speech and noise spatial covariance estimates in a reliable way. We show that using a single mask across microphones for covariance prediction with minima-limited post-masking yields the best result in terms of signal-level quality measures and speech recognition word error rates in a mismatched training condition.

Original languageEnglish
Pages (from-to)1981-1985
Number of pages5
JournalUnknown Journal
Volume08-12-September-2016
DOIs
Publication statusPublished - 2016
Externally publishedYes

Fingerprint

Minimum Variance
beamforming
Beamforming
Mask
Masks
masks
Speech Recognition
Speech recognition
speech recognition
Prediction
Microphones
predictions
microphones
Covariance Estimation
Speech Enhancement
Speech enhancement
Quality Measures
Performance Prediction
Masking
Error Rate

Keywords

  • LSTM
  • Microphone arrays
  • MVDR beamforming
  • Neural networks
  • Speech enhancement

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Erdogan, H., Hershey, J., Watanabe, S., Mandel, M., & Le Roux, J. (2016). Improved MVDR beamforming using single-channel mask prediction networks. Unknown Journal, 08-12-September-2016, 1981-1985. https://doi.org/10.21437/Interspeech.2016-552

Improved MVDR beamforming using single-channel mask prediction networks. / Erdogan, Hakan; Hershey, John; Watanabe, Shinji; Mandel, Michael; Le Roux, Jonathan.

In: Unknown Journal, Vol. 08-12-September-2016, 2016, p. 1981-1985.

Research output: Contribution to journalArticle

Erdogan, H, Hershey, J, Watanabe, S, Mandel, M & Le Roux, J 2016, 'Improved MVDR beamforming using single-channel mask prediction networks', Unknown Journal, vol. 08-12-September-2016, pp. 1981-1985. https://doi.org/10.21437/Interspeech.2016-552
Erdogan, Hakan ; Hershey, John ; Watanabe, Shinji ; Mandel, Michael ; Le Roux, Jonathan. / Improved MVDR beamforming using single-channel mask prediction networks. In: Unknown Journal. 2016 ; Vol. 08-12-September-2016. pp. 1981-1985.
@article{c62e942cdf2d43419495cbcae24f8a6e,
title = "Improved MVDR beamforming using single-channel mask prediction networks",
abstract = "Recent studies on multi-microphone speech databases indicate that it is beneficial to perform beamforming to improve speech recognition accuracies, especially when there is a high level of background noise. Minimum variance distortionless response (MVDR) beamforming is an important beamforming method that performs quite well for speech recognition purposes especially if the steering vector is known. However, steering the beamformer to focus on speech in unknown acoustic conditions remains a challenging problem. In this study, we use singlechannel speech enhancement deep networks to form masks that can be used for noise spatial covariance estimation, which steers the MVDR beamforming toward the speech. We analyze how mask prediction affects performance and also discuss various ways to use masks to obtain the speech and noise spatial covariance estimates in a reliable way. We show that using a single mask across microphones for covariance prediction with minima-limited post-masking yields the best result in terms of signal-level quality measures and speech recognition word error rates in a mismatched training condition.",
keywords = "LSTM, Microphone arrays, MVDR beamforming, Neural networks, Speech enhancement",
author = "Hakan Erdogan and John Hershey and Shinji Watanabe and Michael Mandel and {Le Roux}, Jonathan",
year = "2016",
doi = "10.21437/Interspeech.2016-552",
language = "English",
volume = "08-12-September-2016",
pages = "1981--1985",
journal = "Nuclear Physics A",
issn = "0375-9474",
publisher = "Elsevier",

}

TY - JOUR

T1 - Improved MVDR beamforming using single-channel mask prediction networks

AU - Erdogan, Hakan

AU - Hershey, John

AU - Watanabe, Shinji

AU - Mandel, Michael

AU - Le Roux, Jonathan

PY - 2016

Y1 - 2016

N2 - Recent studies on multi-microphone speech databases indicate that it is beneficial to perform beamforming to improve speech recognition accuracies, especially when there is a high level of background noise. Minimum variance distortionless response (MVDR) beamforming is an important beamforming method that performs quite well for speech recognition purposes especially if the steering vector is known. However, steering the beamformer to focus on speech in unknown acoustic conditions remains a challenging problem. In this study, we use singlechannel speech enhancement deep networks to form masks that can be used for noise spatial covariance estimation, which steers the MVDR beamforming toward the speech. We analyze how mask prediction affects performance and also discuss various ways to use masks to obtain the speech and noise spatial covariance estimates in a reliable way. We show that using a single mask across microphones for covariance prediction with minima-limited post-masking yields the best result in terms of signal-level quality measures and speech recognition word error rates in a mismatched training condition.

AB - Recent studies on multi-microphone speech databases indicate that it is beneficial to perform beamforming to improve speech recognition accuracies, especially when there is a high level of background noise. Minimum variance distortionless response (MVDR) beamforming is an important beamforming method that performs quite well for speech recognition purposes especially if the steering vector is known. However, steering the beamformer to focus on speech in unknown acoustic conditions remains a challenging problem. In this study, we use singlechannel speech enhancement deep networks to form masks that can be used for noise spatial covariance estimation, which steers the MVDR beamforming toward the speech. We analyze how mask prediction affects performance and also discuss various ways to use masks to obtain the speech and noise spatial covariance estimates in a reliable way. We show that using a single mask across microphones for covariance prediction with minima-limited post-masking yields the best result in terms of signal-level quality measures and speech recognition word error rates in a mismatched training condition.

KW - LSTM

KW - Microphone arrays

KW - MVDR beamforming

KW - Neural networks

KW - Speech enhancement

UR - http://www.scopus.com/inward/record.url?scp=84994300465&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994300465&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2016-552

DO - 10.21437/Interspeech.2016-552

M3 - Article

VL - 08-12-September-2016

SP - 1981

EP - 1985

JO - Nuclear Physics A

JF - Nuclear Physics A

SN - 0375-9474

ER -