Student-teacher learning for BLSTM mask-based speech enhancement

Aswin Shanmugam Subramanian, Szu Jui Chen, Shinji Watanabe

Research output: Contribution to journalConference article

2 Citations (Scopus)

Abstract

Spectral mask estimation using bidirectional long short-term memory (BLSTM) neural networks has been widely used in various speech enhancement applications, and it has achieved great success when it is applied to multichannel enhancement techniques with a mask-based beamformer. However, when these masks are used for single channel speech enhancement they severely distort the speech signal and make them unsuitable for speech recognition. This paper proposes a student-teacher learning paradigm for single channel speech enhancement. The beamformed signal from multichannel enhancement is given as input to the teacher network to obtain soft masks. An additional cross-entropy loss term with the soft mask target is combined with the original loss, so that the student network with single-channel input is trained to mimic the soft mask obtained with multichannel input through beamforming. Experiments with the CHiME-4 challenge single channel track data shows improvement in ASR performance.

Original languageEnglish
Pages (from-to)3249-3253
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
Publication statusPublished - 2018 Jan 1
Externally publishedYes
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: 2018 Sep 22018 Sep 6

Fingerprint

Speech Enhancement
Speech enhancement
Memory Term
Mask
Masks
Students
Enhancement
Entropy Loss
Cross-entropy
Speech Signal
Beamforming
Speech Recognition
Speech recognition
Learning
Long short-term memory
Teacher Learning
Short-term Memory
Entropy
Paradigm
Neural Networks

Keywords

  • BLSTM
  • Mask estimation
  • Speech enhancement
  • Speech recognition
  • Student-teacher learning

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Student-teacher learning for BLSTM mask-based speech enhancement. / Subramanian, Aswin Shanmugam; Chen, Szu Jui; Watanabe, Shinji.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2018-September, 01.01.2018, p. 3249-3253.

Research output: Contribution to journalConference article

@article{ef4042567365403c8e155992b891fd2c,
title = "Student-teacher learning for BLSTM mask-based speech enhancement",
abstract = "Spectral mask estimation using bidirectional long short-term memory (BLSTM) neural networks has been widely used in various speech enhancement applications, and it has achieved great success when it is applied to multichannel enhancement techniques with a mask-based beamformer. However, when these masks are used for single channel speech enhancement they severely distort the speech signal and make them unsuitable for speech recognition. This paper proposes a student-teacher learning paradigm for single channel speech enhancement. The beamformed signal from multichannel enhancement is given as input to the teacher network to obtain soft masks. An additional cross-entropy loss term with the soft mask target is combined with the original loss, so that the student network with single-channel input is trained to mimic the soft mask obtained with multichannel input through beamforming. Experiments with the CHiME-4 challenge single channel track data shows improvement in ASR performance.",
keywords = "BLSTM, Mask estimation, Speech enhancement, Speech recognition, Student-teacher learning",
author = "Subramanian, {Aswin Shanmugam} and Chen, {Szu Jui} and Shinji Watanabe",
year = "2018",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2018-2440",
language = "English",
volume = "2018-September",
pages = "3249--3253",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Student-teacher learning for BLSTM mask-based speech enhancement

AU - Subramanian, Aswin Shanmugam

AU - Chen, Szu Jui

AU - Watanabe, Shinji

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Spectral mask estimation using bidirectional long short-term memory (BLSTM) neural networks has been widely used in various speech enhancement applications, and it has achieved great success when it is applied to multichannel enhancement techniques with a mask-based beamformer. However, when these masks are used for single channel speech enhancement they severely distort the speech signal and make them unsuitable for speech recognition. This paper proposes a student-teacher learning paradigm for single channel speech enhancement. The beamformed signal from multichannel enhancement is given as input to the teacher network to obtain soft masks. An additional cross-entropy loss term with the soft mask target is combined with the original loss, so that the student network with single-channel input is trained to mimic the soft mask obtained with multichannel input through beamforming. Experiments with the CHiME-4 challenge single channel track data shows improvement in ASR performance.

AB - Spectral mask estimation using bidirectional long short-term memory (BLSTM) neural networks has been widely used in various speech enhancement applications, and it has achieved great success when it is applied to multichannel enhancement techniques with a mask-based beamformer. However, when these masks are used for single channel speech enhancement they severely distort the speech signal and make them unsuitable for speech recognition. This paper proposes a student-teacher learning paradigm for single channel speech enhancement. The beamformed signal from multichannel enhancement is given as input to the teacher network to obtain soft masks. An additional cross-entropy loss term with the soft mask target is combined with the original loss, so that the student network with single-channel input is trained to mimic the soft mask obtained with multichannel input through beamforming. Experiments with the CHiME-4 challenge single channel track data shows improvement in ASR performance.

KW - BLSTM

KW - Mask estimation

KW - Speech enhancement

KW - Speech recognition

KW - Student-teacher learning

UR - http://www.scopus.com/inward/record.url?scp=85054958811&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054958811&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-2440

DO - 10.21437/Interspeech.2018-2440

M3 - Conference article

AN - SCOPUS:85054958811

VL - 2018-September

SP - 3249

EP - 3253

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -