Lipreading using convolutional neural network

Kuniaki Noda*, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G. Okuno, Tetsuya Ogata

*この研究の対応する著者

研究成果: Conference article査読

61 被引用数 (Scopus)

抄録

In recent automatic speech recognition studies, deep learning architecture applications for acoustic modeling have eclipsed conventional sound features such as Mel-frequency cepstral co- efficients. However, for visual speech recognition (VSR) stud- ies, handcrafted visual feature extraction mechanisms are still widely utilized. In this paper, we propose to apply a convo- lutional neural network (CNN) as a visual feature extraction mechanism for VSR. By training a CNN with images of a speaker's mouth area in combination with phoneme labels, the CNN acquires multiple convolutional filters, used to extract vi- sual features essential for recognizing phonemes. Further, by modeling the temporal dependencies of the generated phoneme label sequences, a hidden Markov model in our proposed sys- Tem recognizes multiple isolated words. Our proposed system is evaluated on an audio-visual speech dataset comprising 300 Japanese words with six different speakers. The evaluation re- sults of our isolated word recognition experiment demonstrate that the visual features acquired by the CNN significantly out- perform those acquired by conventional dimensionality com- pression approaches, including principal component analysis.

本文言語English
ページ(範囲)1149-1153
ページ数5
ジャーナルProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
出版ステータスPublished - 2014 1 1
イベント15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages, INTERSPEECH 2014 - Singapore, Singapore
継続期間: 2014 9 142014 9 18

ASJC Scopus subject areas

  • 言語および言語学
  • 人間とコンピュータの相互作用
  • 信号処理
  • ソフトウェア
  • モデリングとシミュレーション

フィンガープリント

「Lipreading using convolutional neural network」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル