Effect of spectrogram resolution on deep-neural-network-based speech enhancement

Daiki Takeuchi, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada

Research output: Contribution to journalArticlepeer-review

Abstract

In recent single-channel speech enhancement, deep neural network (DNN) has played a quite important role for achieving high performance. One standard use of DNN is to construct a maskgenerating function for time-frequency (T-F) masking. For applying a mask in T-F domain, the shorttime Fourier transform (STFT) is usually utilized because of its well-understood and invertible nature. While the mask-generating regression function has been studied for a long time, there is less research on T-F transform from the viewpoint of speech enhancement. Since the performance of speech enhancement depends on both the T-F mask estimator and T-F transform, investigating T-F transform should be beneficial for designing a better enhancement system. In this paper, as a step toward optimal T-F transform in terms of speech enhancement, we experimentally investigated the effect of parameter settings of STFT on a DNN-based mask estimator. We conducted the experiments using three types of DNN architectures with three types of loss functions, and the results suggested that U-Net is robust to the parameter setting while that is not the case for fully connected and BLSTM networks.

Original languageEnglish
Pages (from-to)769-775
Number of pages7
JournalAcoustical Science and Technology
Volume41
Issue number5
DOIs
Publication statusPublished - 2020 Sep 1

Keywords

  • Deep learning
  • Experimental investigation
  • Redundancy
  • Speech enhancement
  • Time-frequency transform

ASJC Scopus subject areas

  • Acoustics and Ultrasonics

Fingerprint Dive into the research topics of 'Effect of spectrogram resolution on deep-neural-network-based speech enhancement'. Together they form a unique fingerprint.

Cite this