TY - JOUR
T1 - Response Timing Estimation for Spoken Dialog System using Dialog Act Estimation
AU - Sakuma, Jin
AU - Fujie, Shinya
AU - Kobayashi, Tetsunori
N1 - Funding Information:
This research is supported by NII CRIS collaborative research program operated by NII CRIS and LINE Corporation.
Publisher Copyright:
Copyright © 2022 ISCA.
PY - 2022
Y1 - 2022
N2 - We propose neural networks for predicting response timing of spoken dialog systems. Response timing varies depending on the dialog context. This context-dependent response timing is conventionally estimated directly from acoustic event sequences and word sequences extracted from past utterances. Since there are so wide varieties in these sequences, large amounts of training data are required to build reliable models. While, there is no large dialog databases with response timings annotated. The proposed method estimates dialog act for each utterance as an auxiliary task, and uses its intermediate states for response timing estimation in addition to acoustic and linguistic features. Since dialog act has significantly less variation than word sequences and is closely related to response timing, we expect to be able to construct a highly reliable model even with small training data. We evaluate our approach on the HARPERVALLEYBANK corpus. The experimental results show that the proposed approach is more effective than the conventional approach that does not use dialog act information for each utterance such as dialog act.
AB - We propose neural networks for predicting response timing of spoken dialog systems. Response timing varies depending on the dialog context. This context-dependent response timing is conventionally estimated directly from acoustic event sequences and word sequences extracted from past utterances. Since there are so wide varieties in these sequences, large amounts of training data are required to build reliable models. While, there is no large dialog databases with response timings annotated. The proposed method estimates dialog act for each utterance as an auxiliary task, and uses its intermediate states for response timing estimation in addition to acoustic and linguistic features. Since dialog act has significantly less variation than word sequences and is closely related to response timing, we expect to be able to construct a highly reliable model even with small training data. We evaluate our approach on the HARPERVALLEYBANK corpus. The experimental results show that the proposed approach is more effective than the conventional approach that does not use dialog act information for each utterance such as dialog act.
KW - dialog act estimation
KW - response timing
KW - spoken dialog system
UR - http://www.scopus.com/inward/record.url?scp=85140051063&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140051063&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2022-746
DO - 10.21437/Interspeech.2022-746
M3 - Conference article
AN - SCOPUS:85140051063
SN - 2308-457X
VL - 2022-September
SP - 4486
EP - 4490
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
Y2 - 18 September 2022 through 22 September 2022
ER -