BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic Sound Event Detection

Tomoki Hayashi, Shinji Watanabe, Tomoki Toda, Takaaki Hori, Jonathan Le Roux, Kazuya Takeda

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

This paper presents a new hybrid approach for polyphonic Sound Event Detection (SED) which incorporates a temporal structure modeling technique based on a hidden Markov model (HMM) with a frame-by-frame detection method based on a bidirectional long short-term memory (BLSTM) recurrent neural network (RNN). The proposed BLSTM-HMM hybrid system makes it possible to model sound event-dependent temporal structures and also to perform sequence-by-sequence detection without having to resort to thresholding such as in the conventional frame-by-frame methods. Furthermore, to effectively reduce insertion errors of sound events, which often occurs under noisy conditions, we additionally implement a binary mask post-processing using a sound activity detection (SAD) network to identify segments with any sound event activity. We conduct an experiment using the DCASE 2016 task 2 dataset to compare our proposed method with typical conventional methods, such as non-negative matrix factorization (NMF) and a standard BLSTM-RNN. Our proposed method outperforms the conventional methods and achieves an F1-score 74.9 % (error rate of 44.7 %) on the event-based evaluation, and an F1-score of 80.5 % (error rate of 33.8 %) on the segment-based evaluation, most of which also outperforms the best reported result in the DCASE 2016 task 2 challenge.

Original languageEnglish
Title of host publication2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages766-770
Number of pages5
ISBN (Electronic)9781509041176
DOIs
Publication statusPublished - 2017 Jun 16
Externally publishedYes
Event2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - New Orleans, United States
Duration: 2017 Mar 52017 Mar 9

Other

Other2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017
CountryUnited States
CityNew Orleans
Period17/3/517/3/9

Fingerprint

Hidden Markov models
Hybrid systems
Acoustic waves
Recurrent neural networks
Factorization
Masks
Long short-term memory
Processing
Experiments

Keywords

  • BLSTMHMM
  • Hybrid system
  • Polyphonic sound event detection
  • Sound activity detection

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Hayashi, T., Watanabe, S., Toda, T., Hori, T., Le Roux, J., & Takeda, K. (2017). BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic Sound Event Detection. In 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings (pp. 766-770). [7952259] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2017.7952259

BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic Sound Event Detection. / Hayashi, Tomoki; Watanabe, Shinji; Toda, Tomoki; Hori, Takaaki; Le Roux, Jonathan; Takeda, Kazuya.

2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. p. 766-770 7952259.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hayashi, T, Watanabe, S, Toda, T, Hori, T, Le Roux, J & Takeda, K 2017, BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic Sound Event Detection. in 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings., 7952259, Institute of Electrical and Electronics Engineers Inc., pp. 766-770, 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017, New Orleans, United States, 17/3/5. https://doi.org/10.1109/ICASSP.2017.7952259
Hayashi T, Watanabe S, Toda T, Hori T, Le Roux J, Takeda K. BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic Sound Event Detection. In 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2017. p. 766-770. 7952259 https://doi.org/10.1109/ICASSP.2017.7952259
Hayashi, Tomoki ; Watanabe, Shinji ; Toda, Tomoki ; Hori, Takaaki ; Le Roux, Jonathan ; Takeda, Kazuya. / BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic Sound Event Detection. 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 766-770
@inproceedings{0bd2a23e525f4951b1ff86b134870666,
title = "BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic Sound Event Detection",
abstract = "This paper presents a new hybrid approach for polyphonic Sound Event Detection (SED) which incorporates a temporal structure modeling technique based on a hidden Markov model (HMM) with a frame-by-frame detection method based on a bidirectional long short-term memory (BLSTM) recurrent neural network (RNN). The proposed BLSTM-HMM hybrid system makes it possible to model sound event-dependent temporal structures and also to perform sequence-by-sequence detection without having to resort to thresholding such as in the conventional frame-by-frame methods. Furthermore, to effectively reduce insertion errors of sound events, which often occurs under noisy conditions, we additionally implement a binary mask post-processing using a sound activity detection (SAD) network to identify segments with any sound event activity. We conduct an experiment using the DCASE 2016 task 2 dataset to compare our proposed method with typical conventional methods, such as non-negative matrix factorization (NMF) and a standard BLSTM-RNN. Our proposed method outperforms the conventional methods and achieves an F1-score 74.9 {\%} (error rate of 44.7 {\%}) on the event-based evaluation, and an F1-score of 80.5 {\%} (error rate of 33.8 {\%}) on the segment-based evaluation, most of which also outperforms the best reported result in the DCASE 2016 task 2 challenge.",
keywords = "BLSTMHMM, Hybrid system, Polyphonic sound event detection, Sound activity detection",
author = "Tomoki Hayashi and Shinji Watanabe and Tomoki Toda and Takaaki Hori and {Le Roux}, Jonathan and Kazuya Takeda",
year = "2017",
month = "6",
day = "16",
doi = "10.1109/ICASSP.2017.7952259",
language = "English",
pages = "766--770",
booktitle = "2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic Sound Event Detection

AU - Hayashi, Tomoki

AU - Watanabe, Shinji

AU - Toda, Tomoki

AU - Hori, Takaaki

AU - Le Roux, Jonathan

AU - Takeda, Kazuya

PY - 2017/6/16

Y1 - 2017/6/16

N2 - This paper presents a new hybrid approach for polyphonic Sound Event Detection (SED) which incorporates a temporal structure modeling technique based on a hidden Markov model (HMM) with a frame-by-frame detection method based on a bidirectional long short-term memory (BLSTM) recurrent neural network (RNN). The proposed BLSTM-HMM hybrid system makes it possible to model sound event-dependent temporal structures and also to perform sequence-by-sequence detection without having to resort to thresholding such as in the conventional frame-by-frame methods. Furthermore, to effectively reduce insertion errors of sound events, which often occurs under noisy conditions, we additionally implement a binary mask post-processing using a sound activity detection (SAD) network to identify segments with any sound event activity. We conduct an experiment using the DCASE 2016 task 2 dataset to compare our proposed method with typical conventional methods, such as non-negative matrix factorization (NMF) and a standard BLSTM-RNN. Our proposed method outperforms the conventional methods and achieves an F1-score 74.9 % (error rate of 44.7 %) on the event-based evaluation, and an F1-score of 80.5 % (error rate of 33.8 %) on the segment-based evaluation, most of which also outperforms the best reported result in the DCASE 2016 task 2 challenge.

AB - This paper presents a new hybrid approach for polyphonic Sound Event Detection (SED) which incorporates a temporal structure modeling technique based on a hidden Markov model (HMM) with a frame-by-frame detection method based on a bidirectional long short-term memory (BLSTM) recurrent neural network (RNN). The proposed BLSTM-HMM hybrid system makes it possible to model sound event-dependent temporal structures and also to perform sequence-by-sequence detection without having to resort to thresholding such as in the conventional frame-by-frame methods. Furthermore, to effectively reduce insertion errors of sound events, which often occurs under noisy conditions, we additionally implement a binary mask post-processing using a sound activity detection (SAD) network to identify segments with any sound event activity. We conduct an experiment using the DCASE 2016 task 2 dataset to compare our proposed method with typical conventional methods, such as non-negative matrix factorization (NMF) and a standard BLSTM-RNN. Our proposed method outperforms the conventional methods and achieves an F1-score 74.9 % (error rate of 44.7 %) on the event-based evaluation, and an F1-score of 80.5 % (error rate of 33.8 %) on the segment-based evaluation, most of which also outperforms the best reported result in the DCASE 2016 task 2 challenge.

KW - BLSTMHMM

KW - Hybrid system

KW - Polyphonic sound event detection

KW - Sound activity detection

UR - http://www.scopus.com/inward/record.url?scp=85023781320&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85023781320&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2017.7952259

DO - 10.1109/ICASSP.2017.7952259

M3 - Conference contribution

AN - SCOPUS:85023781320

SP - 766

EP - 770

BT - 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -