TY - GEN
T1 - BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic Sound Event Detection
AU - Hayashi, Tomoki
AU - Watanabe, Shinji
AU - Toda, Tomoki
AU - Hori, Takaaki
AU - Le Roux, Jonathan
AU - Takeda, Kazuya
PY - 2017/6/16
Y1 - 2017/6/16
N2 - This paper presents a new hybrid approach for polyphonic Sound Event Detection (SED) which incorporates a temporal structure modeling technique based on a hidden Markov model (HMM) with a frame-by-frame detection method based on a bidirectional long short-term memory (BLSTM) recurrent neural network (RNN). The proposed BLSTM-HMM hybrid system makes it possible to model sound event-dependent temporal structures and also to perform sequence-by-sequence detection without having to resort to thresholding such as in the conventional frame-by-frame methods. Furthermore, to effectively reduce insertion errors of sound events, which often occurs under noisy conditions, we additionally implement a binary mask post-processing using a sound activity detection (SAD) network to identify segments with any sound event activity. We conduct an experiment using the DCASE 2016 task 2 dataset to compare our proposed method with typical conventional methods, such as non-negative matrix factorization (NMF) and a standard BLSTM-RNN. Our proposed method outperforms the conventional methods and achieves an F1-score 74.9 % (error rate of 44.7 %) on the event-based evaluation, and an F1-score of 80.5 % (error rate of 33.8 %) on the segment-based evaluation, most of which also outperforms the best reported result in the DCASE 2016 task 2 challenge.
AB - This paper presents a new hybrid approach for polyphonic Sound Event Detection (SED) which incorporates a temporal structure modeling technique based on a hidden Markov model (HMM) with a frame-by-frame detection method based on a bidirectional long short-term memory (BLSTM) recurrent neural network (RNN). The proposed BLSTM-HMM hybrid system makes it possible to model sound event-dependent temporal structures and also to perform sequence-by-sequence detection without having to resort to thresholding such as in the conventional frame-by-frame methods. Furthermore, to effectively reduce insertion errors of sound events, which often occurs under noisy conditions, we additionally implement a binary mask post-processing using a sound activity detection (SAD) network to identify segments with any sound event activity. We conduct an experiment using the DCASE 2016 task 2 dataset to compare our proposed method with typical conventional methods, such as non-negative matrix factorization (NMF) and a standard BLSTM-RNN. Our proposed method outperforms the conventional methods and achieves an F1-score 74.9 % (error rate of 44.7 %) on the event-based evaluation, and an F1-score of 80.5 % (error rate of 33.8 %) on the segment-based evaluation, most of which also outperforms the best reported result in the DCASE 2016 task 2 challenge.
KW - BLSTMHMM
KW - Hybrid system
KW - Polyphonic sound event detection
KW - Sound activity detection
UR - http://www.scopus.com/inward/record.url?scp=85023781320&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85023781320&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.2017.7952259
DO - 10.1109/ICASSP.2017.7952259
M3 - Conference contribution
AN - SCOPUS:85023781320
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 766
EP - 770
BT - 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017
Y2 - 5 March 2017 through 9 March 2017
ER -