Multi-angle lipreading with angle classification-based feature extraction and its application to audio-visual speech recognition

Shinnosuke Isobe, Satoshi Tamura, Satoru Hayamizu, Yuuto Gotoh, Masaki Nose

Research output: Contribution to journalArticlepeer-review

Abstract

Recently, automatic speech recognition (ASR) and visual speech recognition (VSR) have been widely researched owing to development in deep learning. Most VSR research works focus only on frontal face images. However, assuming real scenes, it is obvious that a VSR system should correctly recognize spoken contents from not only frontal but also diagonal or profile faces. In this paper, we propose a novel VSR method that is applicable to faces taken at any angle. Firstly, view classification is carried out to estimate face angles. Based on the results, feature extraction is then conducted using the best combination of pre-trained feature extraction models. Next, lipreading is carried out using the features. We also developed audio-visual speech recognition (AVSR) using the VSR in addition to conventional ASR. Audio results were obtained from ASR, followed by incorporating audio and visual results in a decision fusion manner. We evaluated our methods using OuluVS2, a multi-angle audio-visual database. We then confirmed that our approach achieved the best performance among conventional VSR schemes in a phrase classification task. In addition, we found that our AVSR results are better than ASR and VSR results.

Original languageEnglish
Article number182
JournalFuture Internet
Volume13
Issue number7
DOIs
Publication statusPublished - 2021 Jul
Externally publishedYes

Keywords

  • Audiovisual speech recognition
  • Automatic speech recognition
  • Deep learning
  • Multi-angle lipreading
  • View classification
  • Visual speech recognition

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Multi-angle lipreading with angle classification-based feature extraction and its application to audio-visual speech recognition'. Together they form a unique fingerprint.

Cite this