TY - JOUR
T1 - Audio-Visual Wake Word Spotting in MISP2021 Challenge
T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
AU - Zhou, Hengshun
AU - Du, Jun
AU - Zou, Gongzhen
AU - Nian, Zhaoxu
AU - Lee, Chin Hui
AU - Siniscalchi, Sabato Marco
AU - Watanabe, Shinji
AU - Scharenborg, Odette
AU - Chen, Jingdong
AU - Xiong, Shifu
AU - Gao, Jian Qing
N1 - Funding Information:
This work was supported by the National Natural Science Foundation of China under Grant No. 62171427 and the Strategic Priority Research Program of Chinese Academy of Sciences under Grant No. XDC08050200.
Publisher Copyright:
Copyright © 2022 ISCA.
PY - 2022
Y1 - 2022
N2 - In this paper, we describe and release publicly the audio-visual wake word spotting (WWS) database in the MISP2021 Challenge, which covers a range of scenarios of audio and video data collected by near-, mid-, and far-field microphone arrays, and cameras, to create a shared and publicly available database for WWS. The database and the code 2 are released, which will be a valuable addition to the community for promoting WWS research using multi-modality information in realistic and complex conditions. Moreover, we investigated the different data augmentation methods for single modalities on an end-to-end WWS network. A set of audio-visual fusion experiments and analysis were conducted to observe the assistance from visual information to acoustic information based on different audio and video field configurations. The results showed that the fusion system generally improves over the single-modality (audio- or video-only) system, especially under complex noisy conditions.
AB - In this paper, we describe and release publicly the audio-visual wake word spotting (WWS) database in the MISP2021 Challenge, which covers a range of scenarios of audio and video data collected by near-, mid-, and far-field microphone arrays, and cameras, to create a shared and publicly available database for WWS. The database and the code 2 are released, which will be a valuable addition to the community for promoting WWS research using multi-modality information in realistic and complex conditions. Moreover, we investigated the different data augmentation methods for single modalities on an end-to-end WWS network. A set of audio-visual fusion experiments and analysis were conducted to observe the assistance from visual information to acoustic information based on different audio and video field configurations. The results showed that the fusion system generally improves over the single-modality (audio- or video-only) system, especially under complex noisy conditions.
KW - analysis
KW - audio-visual database
KW - data augmentation
KW - Wake word spotting
UR - http://www.scopus.com/inward/record.url?scp=85140071120&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140071120&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2022-10650
DO - 10.21437/Interspeech.2022-10650
M3 - Conference article
AN - SCOPUS:85140071120
VL - 2022-September
SP - 1111
EP - 1115
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SN - 2308-457X
Y2 - 18 September 2022 through 22 September 2022
ER -