Beamforming networks using spatial covariance features for far-field speech recognition

Xiong Xiao, Shinji Watanabe, Eng Siong Chng, Haizhou Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Recently, a deep beamforming (BF) network was proposed to predict BF weights from phase-carrying features, such as generalized cross correlation (GCC). The BF network is trained jointly with the acoustic model to minimize automatic speech recognition (ASR) cost function. In this paper, we propose to replace GCC with features derived from input signals' spatial covariance matrices (SCM), which contain the phase information of individual frequency bands. Experimental results on the AMI meeting transcription task shows that the BF network using SCM features significantly reduces the word error rate to 44.1% from 47.9% obtained with the conventional ASR pipeline using delay-and-sum BF. Also compared with GCC features, we have observed small but steady gain by 0.6% absolutely. The use of SCM features also facilitate the implementation of more advanced BF methods within a deep learning framework, such as minimum variance distortionless response BF that requires the speech and noise SCM.

Original languageEnglish
Title of host publication2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9789881476821
DOIs
Publication statusPublished - 2017 Jan 17
Externally publishedYes
Event2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 - Jeju, Korea, Republic of
Duration: 2016 Dec 132016 Dec 16

Other

Other2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016
CountryKorea, Republic of
CityJeju
Period16/12/1316/12/16

Fingerprint

Beamforming
Speech recognition
Covariance matrix
Transcription
Cost functions
Frequency bands
Pipelines
Acoustics

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Science Applications
  • Information Systems
  • Signal Processing

Cite this

Xiao, X., Watanabe, S., Chng, E. S., & Li, H. (2017). Beamforming networks using spatial covariance features for far-field speech recognition. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016 [7820724] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/APSIPA.2016.7820724

Beamforming networks using spatial covariance features for far-field speech recognition. / Xiao, Xiong; Watanabe, Shinji; Chng, Eng Siong; Li, Haizhou.

2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc., 2017. 7820724.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Xiao, X, Watanabe, S, Chng, ES & Li, H 2017, Beamforming networks using spatial covariance features for far-field speech recognition. in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016., 7820724, Institute of Electrical and Electronics Engineers Inc., 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016, Jeju, Korea, Republic of, 16/12/13. https://doi.org/10.1109/APSIPA.2016.7820724
Xiao X, Watanabe S, Chng ES, Li H. Beamforming networks using spatial covariance features for far-field speech recognition. In 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc. 2017. 7820724 https://doi.org/10.1109/APSIPA.2016.7820724
Xiao, Xiong ; Watanabe, Shinji ; Chng, Eng Siong ; Li, Haizhou. / Beamforming networks using spatial covariance features for far-field speech recognition. 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016. Institute of Electrical and Electronics Engineers Inc., 2017.
@inproceedings{9ba83e08b86f4a09b802cd865ba3867c,
title = "Beamforming networks using spatial covariance features for far-field speech recognition",
abstract = "Recently, a deep beamforming (BF) network was proposed to predict BF weights from phase-carrying features, such as generalized cross correlation (GCC). The BF network is trained jointly with the acoustic model to minimize automatic speech recognition (ASR) cost function. In this paper, we propose to replace GCC with features derived from input signals' spatial covariance matrices (SCM), which contain the phase information of individual frequency bands. Experimental results on the AMI meeting transcription task shows that the BF network using SCM features significantly reduces the word error rate to 44.1{\%} from 47.9{\%} obtained with the conventional ASR pipeline using delay-and-sum BF. Also compared with GCC features, we have observed small but steady gain by 0.6{\%} absolutely. The use of SCM features also facilitate the implementation of more advanced BF methods within a deep learning framework, such as minimum variance distortionless response BF that requires the speech and noise SCM.",
author = "Xiong Xiao and Shinji Watanabe and Chng, {Eng Siong} and Haizhou Li",
year = "2017",
month = "1",
day = "17",
doi = "10.1109/APSIPA.2016.7820724",
language = "English",
booktitle = "2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - Beamforming networks using spatial covariance features for far-field speech recognition

AU - Xiao, Xiong

AU - Watanabe, Shinji

AU - Chng, Eng Siong

AU - Li, Haizhou

PY - 2017/1/17

Y1 - 2017/1/17

N2 - Recently, a deep beamforming (BF) network was proposed to predict BF weights from phase-carrying features, such as generalized cross correlation (GCC). The BF network is trained jointly with the acoustic model to minimize automatic speech recognition (ASR) cost function. In this paper, we propose to replace GCC with features derived from input signals' spatial covariance matrices (SCM), which contain the phase information of individual frequency bands. Experimental results on the AMI meeting transcription task shows that the BF network using SCM features significantly reduces the word error rate to 44.1% from 47.9% obtained with the conventional ASR pipeline using delay-and-sum BF. Also compared with GCC features, we have observed small but steady gain by 0.6% absolutely. The use of SCM features also facilitate the implementation of more advanced BF methods within a deep learning framework, such as minimum variance distortionless response BF that requires the speech and noise SCM.

AB - Recently, a deep beamforming (BF) network was proposed to predict BF weights from phase-carrying features, such as generalized cross correlation (GCC). The BF network is trained jointly with the acoustic model to minimize automatic speech recognition (ASR) cost function. In this paper, we propose to replace GCC with features derived from input signals' spatial covariance matrices (SCM), which contain the phase information of individual frequency bands. Experimental results on the AMI meeting transcription task shows that the BF network using SCM features significantly reduces the word error rate to 44.1% from 47.9% obtained with the conventional ASR pipeline using delay-and-sum BF. Also compared with GCC features, we have observed small but steady gain by 0.6% absolutely. The use of SCM features also facilitate the implementation of more advanced BF methods within a deep learning framework, such as minimum variance distortionless response BF that requires the speech and noise SCM.

UR - http://www.scopus.com/inward/record.url?scp=85013826101&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85013826101&partnerID=8YFLogxK

U2 - 10.1109/APSIPA.2016.7820724

DO - 10.1109/APSIPA.2016.7820724

M3 - Conference contribution

AN - SCOPUS:85013826101

BT - 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016

PB - Institute of Electrical and Electronics Engineers Inc.

ER -