Audio-visual speech recognition using deep bottleneck features and high-performance lipreading

Satoshi Tamura, Hiroshi Ninomiya, Norihide Kitaoka, Shin Osuga, Yurie Iribe, Kazuya Takeda, Satoru Hayamizu

研究成果

30 被引用数 (Scopus)

抄録

This paper develops an Audio-Visual Speech Recognition (AVSR) method, by (1) exploring high-performance visual features, (2) applying audio and visual deep bottleneck features to improve AVSR performance, and (3) investigating effectiveness of voice activity detection in a visual modality. In our approach, many kinds of visual features are incorporated, subsequently converted into bottleneck features by deep learning technology. By using proposed features, we successfully achieved 73.66% lipreading accuracy in speaker-independent open condition, and about 90% AVSR accuracy on average in noisy environments. In addition, we extracted speech segments from visual features, resulting 77.80% lipreading accuracy. It is found VAD is useful in both audio and visual modalities, for better lipreading and AVSR.

本文言語English
ホスト出版物のタイトル2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015
出版社Institute of Electrical and Electronics Engineers Inc.
ページ575-582
ページ数8
ISBN(電子版)9789881476807
DOI
出版ステータスPublished - 2016 2 19
外部発表はい
イベント2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015 - Hong Kong, Hong Kong
継続期間: 2015 12 162015 12 19

出版物シリーズ

名前2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015

Other

Other2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2015
国/地域Hong Kong
CityHong Kong
Period15/12/1615/12/19

ASJC Scopus subject areas

  • 人工知能
  • モデリングとシミュレーション
  • 信号処理

フィンガープリント

「Audio-visual speech recognition using deep bottleneck features and high-performance lipreading」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル