TY - GEN
T1 - Analysis of Multimodal Features for Speaking Proficiency Scoring in an Interview Dialogue
AU - Saeki, Mao
AU - Matsuyama, Yoichi
AU - Kobashikawa, Satoshi
AU - Ogawa, Tetsuji
AU - Kobayashi, Tetsunori
N1 - Funding Information:
This work was partially supported by the joint research with Waseda University Academic Solutions Corporation (”Automatic Tutorial English Class Placement System Based on Fig. 3. Confusion matrix of the overall CEFR scoring using lexical + acoustic model
Publisher Copyright:
© 2021 IEEE.
PY - 2021/1/19
Y1 - 2021/1/19
N2 - This paper analyzes the effectiveness of different modalities in automated speaking proficiency scoring in an online dialogue task of non-native speakers. Conversational competence of a language learner can be assessed through the use of multimodal behaviors such as speech content, prosody, and visual cues. Although lexical and acoustic features have been widely studied, there has been no study on the usage of visual features, such as facial expressions and eye gaze. To build an automated speaking proficiency scoring system using multi-modal features, we first constructed an online video interview dataset of 210 Japanese English-learners with annotations of their speaking proficiency. We then examined two approaches for incorporating visual features and compared the effectiveness of each modality. Results show the end-to-end approach with deep neural networks achieves a higher correlation with human scoring than one with handcrafted features. Modalities are effective in the order of lexical, acoustic, and visual features.
AB - This paper analyzes the effectiveness of different modalities in automated speaking proficiency scoring in an online dialogue task of non-native speakers. Conversational competence of a language learner can be assessed through the use of multimodal behaviors such as speech content, prosody, and visual cues. Although lexical and acoustic features have been widely studied, there has been no study on the usage of visual features, such as facial expressions and eye gaze. To build an automated speaking proficiency scoring system using multi-modal features, we first constructed an online video interview dataset of 210 Japanese English-learners with annotations of their speaking proficiency. We then examined two approaches for incorporating visual features and compared the effectiveness of each modality. Results show the end-to-end approach with deep neural networks achieves a higher correlation with human scoring than one with handcrafted features. Modalities are effective in the order of lexical, acoustic, and visual features.
KW - BERT (Bidirectional Encoder Representations from Transformers)
KW - Speaking proficiency assessment
KW - multi-modal machine learning
UR - http://www.scopus.com/inward/record.url?scp=85103928400&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85103928400&partnerID=8YFLogxK
U2 - 10.1109/SLT48900.2021.9383590
DO - 10.1109/SLT48900.2021.9383590
M3 - Conference contribution
AN - SCOPUS:85103928400
T3 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
SP - 629
EP - 635
BT - 2021 IEEE Spoken Language Technology Workshop, SLT 2021 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE Spoken Language Technology Workshop, SLT 2021
Y2 - 19 January 2021 through 22 January 2021
ER -