Data-driven Design of Perfect Reconstruction Filterbank for DNN-based Sound Source Enhancement

Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

研究成果: Conference contribution

3 引用 (Scopus)

抄録

We propose a data-driven design method of perfect-reconstruction filterbank (PRFB) for sound-source enhancement (SSE) based on deep neural network (DNN). DNNs have been used to estimate a time-frequency (T-F) mask in the short-time Fourier transform (STFT) domain. Their training is more stable when a simple cost function as mean-squared error (MSE) is utilized comparing to some advanced cost such as objective sound quality assessments. However, such a simple cost function inherits strong assumptions on the statistics of the target and/or noise which is often not satisfied, and the mismatch of assumption results in degraded performance. In this paper, we propose to design the frequency scale of PRFB from training data so that the assumption on MSE is satisfied. For designing the frequency scale, the warped filterbank frame (WFBF) is considered as PRFB. The frequency characteristic of learned WFBF was in between STFT and the wavelet transform, and its effectiveness was confirmed by comparison with a standard STFT-based DNN whose input feature is compressed into the mel scale.

元の言語English
ホスト出版物のタイトル2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
出版者Institute of Electrical and Electronics Engineers Inc.
ページ596-600
ページ数5
ISBN(電子版)9781479981311
DOI
出版物ステータスPublished - 2019 5 1
イベント44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Brighton, United Kingdom
継続期間: 2019 5 122019 5 17

出版物シリーズ

名前ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
2019-May
ISSN(印刷物)1520-6149

Conference

Conference44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
United Kingdom
Brighton
期間19/5/1219/5/17

Fingerprint

Fourier transforms
Acoustic waves
Cost functions
Wavelet transforms
Masks
Statistics
Deep neural networks
Costs

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

これを引用

Takeuchi, D., Yatabe, K., Koizumi, Y., Oikawa, Y., & Harada, N. (2019). Data-driven Design of Perfect Reconstruction Filterbank for DNN-based Sound Source Enhancement. : 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings (pp. 596-600). [8683861] (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 巻数 2019-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2019.8683861

Data-driven Design of Perfect Reconstruction Filterbank for DNN-based Sound Source Enhancement. / Takeuchi, Daiki; Yatabe, Kohei; Koizumi, Yuma; Oikawa, Yasuhiro; Harada, Noboru.

2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. p. 596-600 8683861 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; 巻 2019-May).

研究成果: Conference contribution

Takeuchi, D, Yatabe, K, Koizumi, Y, Oikawa, Y & Harada, N 2019, Data-driven Design of Perfect Reconstruction Filterbank for DNN-based Sound Source Enhancement. : 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings., 8683861, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 巻. 2019-May, Institute of Electrical and Electronics Engineers Inc., pp. 596-600, 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019, Brighton, United Kingdom, 19/5/12. https://doi.org/10.1109/ICASSP.2019.8683861
Takeuchi D, Yatabe K, Koizumi Y, Oikawa Y, Harada N. Data-driven Design of Perfect Reconstruction Filterbank for DNN-based Sound Source Enhancement. : 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. p. 596-600. 8683861. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2019.8683861
Takeuchi, Daiki ; Yatabe, Kohei ; Koizumi, Yuma ; Oikawa, Yasuhiro ; Harada, Noboru. / Data-driven Design of Perfect Reconstruction Filterbank for DNN-based Sound Source Enhancement. 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 596-600 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).
@inproceedings{051b29a19c734b3cba6dd5af635eaac6,
title = "Data-driven Design of Perfect Reconstruction Filterbank for DNN-based Sound Source Enhancement",
abstract = "We propose a data-driven design method of perfect-reconstruction filterbank (PRFB) for sound-source enhancement (SSE) based on deep neural network (DNN). DNNs have been used to estimate a time-frequency (T-F) mask in the short-time Fourier transform (STFT) domain. Their training is more stable when a simple cost function as mean-squared error (MSE) is utilized comparing to some advanced cost such as objective sound quality assessments. However, such a simple cost function inherits strong assumptions on the statistics of the target and/or noise which is often not satisfied, and the mismatch of assumption results in degraded performance. In this paper, we propose to design the frequency scale of PRFB from training data so that the assumption on MSE is satisfied. For designing the frequency scale, the warped filterbank frame (WFBF) is considered as PRFB. The frequency characteristic of learned WFBF was in between STFT and the wavelet transform, and its effectiveness was confirmed by comparison with a standard STFT-based DNN whose input feature is compressed into the mel scale.",
keywords = "deep learning, frequency-warped filterbank, Learned time-frequency transform, sound source enhancement",
author = "Daiki Takeuchi and Kohei Yatabe and Yuma Koizumi and Yasuhiro Oikawa and Noboru Harada",
year = "2019",
month = "5",
day = "1",
doi = "10.1109/ICASSP.2019.8683861",
language = "English",
series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "596--600",
booktitle = "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings",

}

TY - GEN

T1 - Data-driven Design of Perfect Reconstruction Filterbank for DNN-based Sound Source Enhancement

AU - Takeuchi, Daiki

AU - Yatabe, Kohei

AU - Koizumi, Yuma

AU - Oikawa, Yasuhiro

AU - Harada, Noboru

PY - 2019/5/1

Y1 - 2019/5/1

N2 - We propose a data-driven design method of perfect-reconstruction filterbank (PRFB) for sound-source enhancement (SSE) based on deep neural network (DNN). DNNs have been used to estimate a time-frequency (T-F) mask in the short-time Fourier transform (STFT) domain. Their training is more stable when a simple cost function as mean-squared error (MSE) is utilized comparing to some advanced cost such as objective sound quality assessments. However, such a simple cost function inherits strong assumptions on the statistics of the target and/or noise which is often not satisfied, and the mismatch of assumption results in degraded performance. In this paper, we propose to design the frequency scale of PRFB from training data so that the assumption on MSE is satisfied. For designing the frequency scale, the warped filterbank frame (WFBF) is considered as PRFB. The frequency characteristic of learned WFBF was in between STFT and the wavelet transform, and its effectiveness was confirmed by comparison with a standard STFT-based DNN whose input feature is compressed into the mel scale.

AB - We propose a data-driven design method of perfect-reconstruction filterbank (PRFB) for sound-source enhancement (SSE) based on deep neural network (DNN). DNNs have been used to estimate a time-frequency (T-F) mask in the short-time Fourier transform (STFT) domain. Their training is more stable when a simple cost function as mean-squared error (MSE) is utilized comparing to some advanced cost such as objective sound quality assessments. However, such a simple cost function inherits strong assumptions on the statistics of the target and/or noise which is often not satisfied, and the mismatch of assumption results in degraded performance. In this paper, we propose to design the frequency scale of PRFB from training data so that the assumption on MSE is satisfied. For designing the frequency scale, the warped filterbank frame (WFBF) is considered as PRFB. The frequency characteristic of learned WFBF was in between STFT and the wavelet transform, and its effectiveness was confirmed by comparison with a standard STFT-based DNN whose input feature is compressed into the mel scale.

KW - deep learning

KW - frequency-warped filterbank

KW - Learned time-frequency transform

KW - sound source enhancement

UR - http://www.scopus.com/inward/record.url?scp=85068981711&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068981711&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2019.8683861

DO - 10.1109/ICASSP.2019.8683861

M3 - Conference contribution

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 596

EP - 600

BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -