TY - GEN
T1 - Speech representation learning combining conformer CPC with deep cluster for the ZeroSpeech challenge 2021
AU - Maekaku, Takashi
AU - Chang, Xuankai
AU - Fujita, Yuya
AU - Chen, Li Wei
AU - Watanabe, Shinji
AU - Rudnicky, Alexander
N1 - Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - We present a system for the Zero Resource Speech Challenge 2021, which combines a Contrastive Predictive Coding (CPC) with deep cluster. In deep cluster, we first prepare pseudo-labels obtained by clustering the outputs of a CPC network with kmeans. Then, we train an additional autoregressive model to classify the previously obtained pseudo-labels in a supervised manner. Phoneme discriminative representation is achieved by executing the second-round clustering with the outputs of the final layer of the autoregressive model. We show that replacing a Transformer layer with a Conformer layer leads to a further gain in a lexical metric. Experimental results show that a relative improvement of 35% in a phonetic metric, 1.5% in the lexical metric, and 2.3% in a syntactic metric are achieved compared to a baseline method of CPC-small which is trained on LibriSpeech 460h data. We achieve top results in this challenge with the syntactic metric.
AB - We present a system for the Zero Resource Speech Challenge 2021, which combines a Contrastive Predictive Coding (CPC) with deep cluster. In deep cluster, we first prepare pseudo-labels obtained by clustering the outputs of a CPC network with kmeans. Then, we train an additional autoregressive model to classify the previously obtained pseudo-labels in a supervised manner. Phoneme discriminative representation is achieved by executing the second-round clustering with the outputs of the final layer of the autoregressive model. We show that replacing a Transformer layer with a Conformer layer leads to a further gain in a lexical metric. Experimental results show that a relative improvement of 35% in a phonetic metric, 1.5% in the lexical metric, and 2.3% in a syntactic metric are achieved compared to a baseline method of CPC-small which is trained on LibriSpeech 460h data. We achieve top results in this challenge with the syntactic metric.
KW - Conformer
KW - Contrastive predictive coding
KW - Deep cluster
UR - http://www.scopus.com/inward/record.url?scp=85113453540&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85113453540&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-1503
DO - 10.21437/Interspeech.2021-1503
M3 - Conference contribution
AN - SCOPUS:85113453540
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 1066
EP - 1070
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Y2 - 30 August 2021 through 3 September 2021
ER -