Abstract
We address the problem of «cocktail-party» source separation in a deep learning framework called deep clustering. Previous deep network approaches to separation have shown promising performance in scenarios with a fixed number of sources, each belonging to a distinct signal class, such as speech and noise. However, for arbitrary source classes and number, «class-based» methods are not suitable. Instead, we train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pair-wise affinity matrix that approximates the ideal affinity matrix, while enabling much faster performance. At test time, the clustering step «decodes» the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB. More dramatically, the same model does surprisingly well with three-speaker mixtures.
Original language | English |
---|---|
Title of host publication | 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Proceedings |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 31-35 |
Number of pages | 5 |
Volume | 2016-May |
ISBN (Electronic) | 9781479999880 |
DOIs | |
Publication status | Published - 2016 May 18 |
Externally published | Yes |
Event | 41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 - Shanghai, China Duration: 2016 Mar 20 → 2016 Mar 25 |
Other
Other | 41st IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016 |
---|---|
Country | China |
City | Shanghai |
Period | 16/3/20 → 16/3/25 |
Keywords
- clustering
- deep learning
- embedding
- speech separation
ASJC Scopus subject areas
- Signal Processing
- Software
- Electrical and Electronic Engineering