TY - GEN
T1 - Joint Acoustic and Class Inference for Weakly Supervised Sound Event Detection
AU - Kothinti, Sandeep
AU - Imoto, Keisuke
AU - Chakrabarty, Debmalya
AU - Sell, Gregory
AU - Watanabe, Shinji
AU - Elhilali, Mounya
N1 - Funding Information:
This research was supported in part by National Institutes of Health grants R01HL133043 and U01AG058532 and Office of Naval Research grants N000141612045 and N000141712736.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/5
Y1 - 2019/5
N2 - Sound event detection is a challenging task, especially for scenes with multiple simultaneous events. While event classification methods tend to be fairly accurate, event localization presents additional challenges, especially when large amounts of labeled data are not available. Task4 of the 2018 DCASE challenge presents an event detection task that requires accuracy in both segmentation and recognition of events while providing only weakly labeled training data. Supervised methods can produce accurate event labels but are limited in event segmentation when training data lacks event timestamps. On the other hand, unsupervised methods that model the acoustic properties of the audio can produce accurate event boundaries but are not guided by the characteristics of event classes and sound categories. We present a hybrid approach that combines an acoustic-driven event boundary detection and a supervised label inference using a deep neural network. This framework leverages benefits of both unsupervised and supervised methodologies and takes advantage of large amounts of unlabeled data, making it ideal for large-scale weakly la-beled event detection. Compared to a baseline system, the proposed approach delivers a 15% absolute improvement in F-score, demonstrating the benefits of the hybrid bottom-up, top-down approach.
AB - Sound event detection is a challenging task, especially for scenes with multiple simultaneous events. While event classification methods tend to be fairly accurate, event localization presents additional challenges, especially when large amounts of labeled data are not available. Task4 of the 2018 DCASE challenge presents an event detection task that requires accuracy in both segmentation and recognition of events while providing only weakly labeled training data. Supervised methods can produce accurate event labels but are limited in event segmentation when training data lacks event timestamps. On the other hand, unsupervised methods that model the acoustic properties of the audio can produce accurate event boundaries but are not guided by the characteristics of event classes and sound categories. We present a hybrid approach that combines an acoustic-driven event boundary detection and a supervised label inference using a deep neural network. This framework leverages benefits of both unsupervised and supervised methodologies and takes advantage of large amounts of unlabeled data, making it ideal for large-scale weakly la-beled event detection. Compared to a baseline system, the proposed approach delivers a 15% absolute improvement in F-score, demonstrating the benefits of the hybrid bottom-up, top-down approach.
KW - Sound event detection
KW - conditional restricted Boltzmann machine
KW - restricted Boltzmann machine
KW - unsupervised learning
KW - weakly labeled data
UR - http://www.scopus.com/inward/record.url?scp=85068999213&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85068999213&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2019.8682772
DO - 10.1109/ICASSP.2019.8682772
M3 - Conference contribution
AN - SCOPUS:85068999213
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 36
EP - 40
BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
Y2 - 12 May 2019 through 17 May 2019
ER -