Single-channel multi-speaker separation using deep clustering

Yusuf Isik, Jonathan Le Roux, Zhuo Chen, Shinji Watanabe, John R. Hershey

Research output: Contribution to journalArticle

67 Citations (Scopus)

Abstract

Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.

Original languageEnglish
Pages (from-to)545-549
Number of pages5
JournalUnknown Journal
Volume08-12-September-2016
DOIs
Publication statusPublished - 2016
Externally publishedYes

Fingerprint

Clustering
Baseline
education
Enhancement
Speech recognition
Spectrogram
Automatic Speech Recognition
spectrograms
augmentation
speech recognition
Approximation
approximation
embedding
Fidelity
learning
Error Rate
System Performance
Regularization
Segmentation
Maximise

Keywords

  • Deep learning
  • Embedding
  • Single-channel speech separation

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Isik, Y., Le Roux, J., Chen, Z., Watanabe, S., & Hershey, J. R. (2016). Single-channel multi-speaker separation using deep clustering. Unknown Journal, 08-12-September-2016, 545-549. https://doi.org/10.21437/Interspeech.2016-1176

Single-channel multi-speaker separation using deep clustering. / Isik, Yusuf; Le Roux, Jonathan; Chen, Zhuo; Watanabe, Shinji; Hershey, John R.

In: Unknown Journal, Vol. 08-12-September-2016, 2016, p. 545-549.

Research output: Contribution to journalArticle

Isik, Y, Le Roux, J, Chen, Z, Watanabe, S & Hershey, JR 2016, 'Single-channel multi-speaker separation using deep clustering', Unknown Journal, vol. 08-12-September-2016, pp. 545-549. https://doi.org/10.21437/Interspeech.2016-1176
Isik, Yusuf ; Le Roux, Jonathan ; Chen, Zhuo ; Watanabe, Shinji ; Hershey, John R. / Single-channel multi-speaker separation using deep clustering. In: Unknown Journal. 2016 ; Vol. 08-12-September-2016. pp. 545-549.
@article{139c7385c9e14fe2b323df1335011f8a,
title = "Single-channel multi-speaker separation using deep clustering",
abstract = "Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1{\%} down to 30.8{\%}. This represents a major advancement towards solving the cocktail party problem.",
keywords = "Deep learning, Embedding, Single-channel speech separation",
author = "Yusuf Isik and {Le Roux}, Jonathan and Zhuo Chen and Shinji Watanabe and Hershey, {John R.}",
year = "2016",
doi = "10.21437/Interspeech.2016-1176",
language = "English",
volume = "08-12-September-2016",
pages = "545--549",
journal = "Nuclear Physics A",
issn = "0375-9474",
publisher = "Elsevier",

}

TY - JOUR

T1 - Single-channel multi-speaker separation using deep clustering

AU - Isik, Yusuf

AU - Le Roux, Jonathan

AU - Chen, Zhuo

AU - Watanabe, Shinji

AU - Hershey, John R.

PY - 2016

Y1 - 2016

N2 - Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.

AB - Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmentation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective, combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.

KW - Deep learning

KW - Embedding

KW - Single-channel speech separation

UR - http://www.scopus.com/inward/record.url?scp=84994236403&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84994236403&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2016-1176

DO - 10.21437/Interspeech.2016-1176

M3 - Article

VL - 08-12-September-2016

SP - 545

EP - 549

JO - Nuclear Physics A

JF - Nuclear Physics A

SN - 0375-9474

ER -