Audio-visual voice conversion using deep canonical correlation analysis for deep bottleneck features

Satoshi Tamura, Kento Horio, Hajime Endo, Satoru Hayamizu, Tomoki Toda

研究成果: Conference article査読

2 被引用数 (Scopus)

抄録

This paper proposes Audio-Visual Voice Conversion (AVVC) methods using Deep BottleNeck Features (DBNF) and Deep Canonical Correlation Analysis (DCCA). DBNF has been adopted in several speech applications to obtain better feature representations. DCCA can generate much correlated features in two views, and enhance features in one modality based on another view. In addition, DCCA can make projections from different views ideally to the same vector space. Firstly, in this work, we enhance our conventional AVVC scheme by employing the DBNF technique in the visual modality. Secondly, we apply the DCCA technology to DBNFs for new effective visual features. Thirdly, we build a cross-modal voice conversion model available for both audio and visual DCCA features. In order to clarify effectiveness of these frameworks, we carried out subjective and objective evaluations and compared them with conventional methods. Experimental results show that our DBNF- and DCCA-based AVVC can successfully improve the quality of converted speech waveforms.

本文言語English
ページ(範囲)2469-2473
ページ数5
ジャーナルProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
2018-September
DOI
出版ステータスPublished - 2018
外部発表はい
イベント19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
継続期間: 2018 9 22018 9 6

ASJC Scopus subject areas

  • 言語および言語学
  • 人間とコンピュータの相互作用
  • 信号処理
  • ソフトウェア
  • モデリングとシミュレーション

フィンガープリント

「Audio-visual voice conversion using deep canonical correlation analysis for deep bottleneck features」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル