TY - JOUR
T1 - Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation
AU - Berrebbi, Dan
AU - Shi, Jiatong
AU - Yan, Brian
AU - López-Francisco, Osbel
AU - Amith, Jonathan
AU - Watanabe, Shinji
N1 - Funding Information:
This work used the Extreme Science and Engineering Discovery Environment (XSEDE) [49], which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system [50], as part of project cis210027p, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).
Publisher Copyright:
Copyright © 2022 ISCA.
PY - 2022
Y1 - 2022
N2 - Self-Supervised Learning (SSL) models have been successfully applied in various deep learning-based speech tasks, particularly those with a limited amount of data. However, the quality of SSL representations depends highly on the relatedness between the SSL training domain(s) and the target data domain. On the contrary, spectral feature (SF) extractors such as log Mel-filterbanks are hand-crafted non-learnable components, and could be more robust to domain shifts. The present work examines the assumption that combining non-learnable SF extractors to SSL models is an effective approach to low resource speech tasks. We propose a learnable and interpretable framework to combine SF and SSL representations. The proposed framework outperforms significantly both baseline and SSL models on Automatic Speech Recognition (ASR) and Speech Translation (ST) tasks on three low resource datasets. We additionally design a mixture of experts based combination model. This last model reveals that the relative contribution of SSL models over conventional SF extractors is very small in case of domain mismatch between SSL training set and the target language data.
AB - Self-Supervised Learning (SSL) models have been successfully applied in various deep learning-based speech tasks, particularly those with a limited amount of data. However, the quality of SSL representations depends highly on the relatedness between the SSL training domain(s) and the target data domain. On the contrary, spectral feature (SF) extractors such as log Mel-filterbanks are hand-crafted non-learnable components, and could be more robust to domain shifts. The present work examines the assumption that combining non-learnable SF extractors to SSL models is an effective approach to low resource speech tasks. We propose a learnable and interpretable framework to combine SF and SSL representations. The proposed framework outperforms significantly both baseline and SSL models on Automatic Speech Recognition (ASR) and Speech Translation (ST) tasks on three low resource datasets. We additionally design a mixture of experts based combination model. This last model reveals that the relative contribution of SSL models over conventional SF extractors is very small in case of domain mismatch between SSL training set and the target language data.
KW - co-Attention
KW - Low Resource
KW - Mixture of Experts
KW - Self-Supervised Learning
KW - Spectral Features
UR - http://www.scopus.com/inward/record.url?scp=85140059971&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140059971&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2022-10796
DO - 10.21437/Interspeech.2022-10796
M3 - Conference article
AN - SCOPUS:85140059971
VL - 2022-September
SP - 3533
EP - 3537
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SN - 2308-457X
T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
Y2 - 18 September 2022 through 22 September 2022
ER -