Multi-channel non-negative matrix factorization (MNMF) is a multi-channel extension of NMF and often outperforms NMF because it can deal with spatial and spectral information simultaneously. On the other hand, MNMF has a larger number of parameters and its performance heavily depends on the initial values. MNMF factorizes an observation matrix into four matrices: spatial correlation, basis, cluster-indicator latent variables, and activation matrices. This paper proposes effective initialization methods for these matrices. First, the spatial correlation matrix, which shows the largest initial value dependencies, is initialized using the cross-spectrum method from enhanced speech by binary masking. Second, when the target is speech, constructing bases from phonemes existing in an utterance can improve the performance: this paper proposes a speech bases selection by using automatic speech recognition (ASR). Third, we also propose an initialization method for the cluster-indicator latent variables that couple the spatial and spectral information, which can achieve the simultaneous optimization of above two matrices. Experiments on a noisy ASR task show that the proposed initialization significantly improves the performance of MNMF by reducing the initial value dependencies.
|ジャーナル||Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH|
|出版ステータス||Published - 2017|
|イベント||18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, Sweden|
継続期間: 2017 8 20 → 2017 8 24
ASJC Scopus subject areas