TY - JOUR
T1 - Teacher-student learning for low-latency online speech enhancement using WAVe-U-net
AU - Nakaoka, Sotaro
AU - Li, Li
AU - Inoue, Shota
AU - Makino, Shoji
N1 - Funding Information:
This work was supported by JSPS KAKENHI Grant Number 19H04131 and Strategic Core Technology Advancement Program (Supporting Industry Program).
Publisher Copyright:
©2021 IEEE
PY - 2021
Y1 - 2021
N2 - In this paper, we propose a low-latency online extension of wave-U-net for single-channel speech enhancement, which utilizes teacher-student learning to reduce the system latency while keeping the enhancement performance high. Wave-U-net is a recently proposed end-to-end source separation method, which achieved remarkable performance in singing voice separation and speech enhancement tasks. Since the enhancement is performed in the time domain, wave-U-net can efficiently model phase information and address the domain transformation limitation, where the time-frequency domain is normally adopted. In this paper, we apply wave-U-net to face-to-face applications such as hearing aids and in-car communication systems, where a strictly low-latency of less than 10 ms is required. To this end, we investigate online versions of wave-U-net and propose the use of teacher-student learning to prevent the performance degradation caused by the reduction in input segment length such that the system delay in a CPU is less than 10 ms. The experimental results revealed that the proposed model could perform in real-time with low-latency and high performance, achieving a signal-to-distortion ratio improvement of about 8.73 dB.
AB - In this paper, we propose a low-latency online extension of wave-U-net for single-channel speech enhancement, which utilizes teacher-student learning to reduce the system latency while keeping the enhancement performance high. Wave-U-net is a recently proposed end-to-end source separation method, which achieved remarkable performance in singing voice separation and speech enhancement tasks. Since the enhancement is performed in the time domain, wave-U-net can efficiently model phase information and address the domain transformation limitation, where the time-frequency domain is normally adopted. In this paper, we apply wave-U-net to face-to-face applications such as hearing aids and in-car communication systems, where a strictly low-latency of less than 10 ms is required. To this end, we investigate online versions of wave-U-net and propose the use of teacher-student learning to prevent the performance degradation caused by the reduction in input segment length such that the system delay in a CPU is less than 10 ms. The experimental results revealed that the proposed model could perform in real-time with low-latency and high performance, achieving a signal-to-distortion ratio improvement of about 8.73 dB.
KW - Low-latency
KW - Real-time
KW - Single-channel speech enhancement
KW - Teacher-student learning
KW - Wave-U-net
UR - http://www.scopus.com/inward/record.url?scp=85115151394&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115151394&partnerID=8YFLogxK
U2 - 10.1109/ICASSP39728.2021.9414280
DO - 10.1109/ICASSP39728.2021.9414280
M3 - Conference article
AN - SCOPUS:85115151394
VL - 2021-June
SP - 661
EP - 665
JO - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
JF - Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing
SN - 0736-7791
T2 - 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021
Y2 - 6 June 2021 through 11 June 2021
ER -