TY - GEN
T1 - S3PRL-VC
T2 - 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022
AU - Huang, Wen Chin
AU - Yang, Shu Wen
AU - Hayashi, Tomoki
AU - Lee, Hung Yi
AU - Watanabe, Shinji
AU - Toda, Tomoki
N1 - Funding Information:
We suggest different future directions for readers from different communities. From the VC perspective, it is worthwhile to continue investigating better downstream model design. For instance, in A2A VC, a proper speaker encoder should be used instead of fixed d-vector. Meanwhile, we encourage to use VC as a probing task when designing a new S3R model, considering the challenges to overcome brought by all aspects required in VC. Acknowledgements We would like to thank the S3PRL/SUPERB team for the fruitful discussions. This work was partly supported by JSPS KAKENHI Grant Number 21J20920, JST CREST Grant Number JPMJCR19A3, and a project, JPNP20006, commissioned by NEDO, Japan.
Publisher Copyright:
© 2022 IEEE
PY - 2022
Y1 - 2022
N2 - This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In the context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace the expensive supervised representation adopted by state-of-the-art VC systems. Moreover, we claim that VC is a good probing task for S3R analysis. In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting. We also provide comparisons between not only different S3Rs but also top systems in VCC2020 with supervised representations. Systematic objective and subjective evaluation were conducted, and we show that S3R is comparable with VCC2020 top systems in the A2O setting in terms of similarity, and achieves state-of-the-art in S3R-based A2A VC. We believe the extensive analysis, as well as the toolkit itself, contribute to not only the S3R community but also the VC community. The codebase is now open-sourced.
AB - This paper introduces S3PRL-VC, an open-source voice conversion (VC) framework based on the S3PRL toolkit. In the context of recognition-synthesis VC, self-supervised speech representation (S3R) is valuable in its potential to replace the expensive supervised representation adopted by state-of-the-art VC systems. Moreover, we claim that VC is a good probing task for S3R analysis. In this work, we provide a series of in-depth analyses by benchmarking on the two tasks in VCC2020, namely intra-/cross-lingual any-to-one (A2O) VC, as well as an any-to-any (A2A) setting. We also provide comparisons between not only different S3Rs but also top systems in VCC2020 with supervised representations. Systematic objective and subjective evaluation were conducted, and we show that S3R is comparable with VCC2020 top systems in the A2O setting in terms of similarity, and achieves state-of-the-art in S3R-based A2A VC. We believe the extensive analysis, as well as the toolkit itself, contribute to not only the S3R community but also the VC community. The codebase is now open-sourced.
KW - open-source
KW - self-supervised learning
KW - self-supervised speech representation
KW - voice conversion
UR - http://www.scopus.com/inward/record.url?scp=85131236670&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85131236670&partnerID=8YFLogxK
U2 - 10.1109/ICASSP43922.2022.9746430
DO - 10.1109/ICASSP43922.2022.9746430
M3 - Conference contribution
AN - SCOPUS:85131236670
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 6552
EP - 6556
BT - 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 23 May 2022 through 27 May 2022
ER -