Deep clustering

Discriminative embeddings for segmentation and separation

John R. Hershey, Zhuo Chen, Jonathan Le Roux, Shinji Watanabe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

185 Citations (Scopus)

Abstract

We address the problem of «cocktail-party» source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, «class-based» methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pair-wise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step «decodes» the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.

Original languageEnglish
Title of host publication2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages31-35
Number of pages5
Volume2016-May
ISBN (Electronic)9781479999880
DOIs
Publication statusPublished - 2016 May 18
Externally publishedYes
Event41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Shanghai, China
Duration: 2016 Mar 202016 Mar 25

Other

Other41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016
CountryChina
CityShanghai
Period16/3/2016/3/25

Fingerprint

Source separation
Labels
Experiments
Deep learning

Keywords

  • clustering
  • deep learning
  • embedding
  • speech separation

ASJC Scopus subject areas

  • Signal Processing
  • Software
  • Electrical and Electronic Engineering

Cite this

Hershey, J. R., Chen, Z., Le Roux, J., & Watanabe, S. (2016). Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings (Vol. 2016-May, pp. 31-35). [7471631] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2016.7471631

Deep clustering : Discriminative embeddings for segmentation and separation. / Hershey, John R.; Chen, Zhuo; Le Roux, Jonathan; Watanabe, Shinji.

2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings. Vol. 2016-May Institute of Electrical and Electronics Engineers Inc., 2016. p. 31-35 7471631.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Hershey, JR, Chen, Z, Le Roux, J & Watanabe, S 2016, Deep clustering: Discriminative embeddings for segmentation and separation. in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings. vol. 2016-May, 7471631, Institute of Electrical and Electronics Engineers Inc., pp. 31-35, 41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, 16/3/20. https://doi.org/10.1109/ICASSP.2016.7471631
Hershey JR, Chen Z, Le Roux J, Watanabe S. Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings. Vol. 2016-May. Institute of Electrical and Electronics Engineers Inc. 2016. p. 31-35. 7471631 https://doi.org/10.1109/ICASSP.2016.7471631
Hershey, John R. ; Chen, Zhuo ; Le Roux, Jonathan ; Watanabe, Shinji. / Deep clustering : Discriminative embeddings for segmentation and separation. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings. Vol. 2016-May Institute of Electrical and Electronics Engineers Inc., 2016. pp. 31-35
@inproceedings{e1543d973f5442ac8cfa27557d25fd97,
title = "Deep clustering: Discriminative embeddings for segmentation and separation",
abstract = "We address the problem of «cocktail-party» source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, «class-based» methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pair-wise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step «decodes» the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.",
keywords = "clustering, deep learning, embedding, speech separation",
author = "Hershey, {John R.} and Zhuo Chen and {Le Roux}, Jonathan and Shinji Watanabe",
year = "2016",
month = "5",
day = "18",
doi = "10.1109/ICASSP.2016.7471631",
language = "English",
volume = "2016-May",
pages = "31--35",
booktitle = "2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

TY - GEN

T1 - Deep clustering

T2 - Discriminative embeddings for segmentation and separation

AU - Hershey, John R.

AU - Chen, Zhuo

AU - Le Roux, Jonathan

AU - Watanabe, Shinji

PY - 2016/5/18

Y1 - 2016/5/18

N2 - We address the problem of «cocktail-party» source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, «class-based» methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pair-wise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step «decodes» the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.

AB - We address the problem of «cocktail-party» source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, «class-based» methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pair-wise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step «decodes» the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.

KW - clustering

KW - deep learning

KW - embedding

KW - speech separation

UR - http://www.scopus.com/inward/record.url?scp=84973320590&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84973320590&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2016.7471631

DO - 10.1109/ICASSP.2016.7471631

M3 - Conference contribution

VL - 2016-May

SP - 31

EP - 35

BT - 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -