TY - JOUR
T1 - Text-Only Domain Adaptation Based on Intermediate CTC
AU - Sato, Hiroaki
AU - Komori, Tomoyasu
AU - Mishima, Takeshi
AU - Kawai, Yoshihiko
AU - Mochizuki, Takahiro
AU - Sato, Shoei
AU - Ogawa, Tetsuji
N1 - Publisher Copyright:
Copyright © 2022 ISCA.
PY - 2022
Y1 - 2022
N2 - We propose a domain adaptation method that enables connectionist temporal classification (CTC)-based end-to-end (E2E) automatic speech recognition (ASR) models to adapt to a target domain using unpaired text data. The performance of ASR models deteriorates for words and topics not present in the training data, such as the latest news. Although it is difficult to collect paired speech and text data for such subjects, unpaired text data is relatively easy to obtain. Therefore, a domain adaptation method using unpaired text data is proposed for the E2E ASR model based on the intermediate CTC. This model introduces an adaptation branch to embed acoustic and linguistic information in the same latent space, allowing for domain adaptation using unpaired text data of the target domain. Experimental comparisons for multiple out-of-domain settings demonstrate that the proposed text-only domain adaptation achieves a comparable or better performance than the existing shallow-fusion-based domain adaptation, and further performance improvement is achieved by integration with shallow fusion.
AB - We propose a domain adaptation method that enables connectionist temporal classification (CTC)-based end-to-end (E2E) automatic speech recognition (ASR) models to adapt to a target domain using unpaired text data. The performance of ASR models deteriorates for words and topics not present in the training data, such as the latest news. Although it is difficult to collect paired speech and text data for such subjects, unpaired text data is relatively easy to obtain. Therefore, a domain adaptation method using unpaired text data is proposed for the E2E ASR model based on the intermediate CTC. This model introduces an adaptation branch to embed acoustic and linguistic information in the same latent space, allowing for domain adaptation using unpaired text data of the target domain. Experimental comparisons for multiple out-of-domain settings demonstrate that the proposed text-only domain adaptation achieves a comparable or better performance than the existing shallow-fusion-based domain adaptation, and further performance improvement is achieved by integration with shallow fusion.
KW - domain adaptation
KW - end-to-end speech recognition
KW - non-autoregressive
KW - unpaired text
UR - http://www.scopus.com/inward/record.url?scp=85140060815&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140060815&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2022-10114
DO - 10.21437/Interspeech.2022-10114
M3 - Conference article
AN - SCOPUS:85140060815
SN - 2308-457X
VL - 2022-September
SP - 2208
EP - 2212
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
Y2 - 18 September 2022 through 22 September 2022
ER -