Audio-visual speech translation with automatic LIP synchronization and face tracking based on 3-D head model

Shigeo Morishima, Shin Ogata, Kazumasa Murai, Satoshi Nakamura

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Citations (Scopus)

Abstract

Speech-to-speech translation has been studied to realize natural human communication beyond language barriers. Toward further multi-modal natural communication, visual information such as face and lip movements will be necessary. In this paper, we introduce a multi-modal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion while synchronizing it to the translated speech. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a three-dimensional wire-frame model that is adaptable to any speaker. Our approach enables image synthesis and translation with an extremely small database. We conduct subjective evaluation by connected digit discrimination using data with and without audio-visual lip-synchronicity. The results confirm the sufficient quality of the proposed audio-visual translation system.

Original languageEnglish
Title of host publicationICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2
Publication statusPublished - 2002
Externally publishedYes
Event2002 IEEE International Conference on Acoustic, Speech and Signal Processing - Orlando, FL, United States
Duration: 2002 May 132002 May 17

Other

Other2002 IEEE International Conference on Acoustic, Speech and Signal Processing
CountryUnited States
CityOrlando, FL
Period02/5/1302/5/17

Fingerprint

synchronism
Synchronization
communication
Visual communication
digits
organs
discrimination
wire
Wire
substitutes
evaluation
Communication
synthesis

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Signal Processing
  • Acoustics and Ultrasonics

Cite this

Morishima, S., Ogata, S., Murai, K., & Nakamura, S. (2002). Audio-visual speech translation with automatic LIP synchronization and face tracking based on 3-D head model. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (Vol. 2)

Audio-visual speech translation with automatic LIP synchronization and face tracking based on 3-D head model. / Morishima, Shigeo; Ogata, Shin; Murai, Kazumasa; Nakamura, Satoshi.

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Vol. 2 2002.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Morishima, S, Ogata, S, Murai, K & Nakamura, S 2002, Audio-visual speech translation with automatic LIP synchronization and face tracking based on 3-D head model. in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. vol. 2, 2002 IEEE International Conference on Acoustic, Speech and Signal Processing, Orlando, FL, United States, 02/5/13.
Morishima S, Ogata S, Murai K, Nakamura S. Audio-visual speech translation with automatic LIP synchronization and face tracking based on 3-D head model. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Vol. 2. 2002
Morishima, Shigeo ; Ogata, Shin ; Murai, Kazumasa ; Nakamura, Satoshi. / Audio-visual speech translation with automatic LIP synchronization and face tracking based on 3-D head model. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Vol. 2 2002.
@inproceedings{ef533303bb4e4ba28629fc5d2845798b,
title = "Audio-visual speech translation with automatic LIP synchronization and face tracking based on 3-D head model",
abstract = "Speech-to-speech translation has been studied to realize natural human communication beyond language barriers. Toward further multi-modal natural communication, visual information such as face and lip movements will be necessary. In this paper, we introduce a multi-modal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion while synchronizing it to the translated speech. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a three-dimensional wire-frame model that is adaptable to any speaker. Our approach enables image synthesis and translation with an extremely small database. We conduct subjective evaluation by connected digit discrimination using data with and without audio-visual lip-synchronicity. The results confirm the sufficient quality of the proposed audio-visual translation system.",
author = "Shigeo Morishima and Shin Ogata and Kazumasa Murai and Satoshi Nakamura",
year = "2002",
language = "English",
volume = "2",
booktitle = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

}

TY - GEN

T1 - Audio-visual speech translation with automatic LIP synchronization and face tracking based on 3-D head model

AU - Morishima, Shigeo

AU - Ogata, Shin

AU - Murai, Kazumasa

AU - Nakamura, Satoshi

PY - 2002

Y1 - 2002

N2 - Speech-to-speech translation has been studied to realize natural human communication beyond language barriers. Toward further multi-modal natural communication, visual information such as face and lip movements will be necessary. In this paper, we introduce a multi-modal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion while synchronizing it to the translated speech. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a three-dimensional wire-frame model that is adaptable to any speaker. Our approach enables image synthesis and translation with an extremely small database. We conduct subjective evaluation by connected digit discrimination using data with and without audio-visual lip-synchronicity. The results confirm the sufficient quality of the proposed audio-visual translation system.

AB - Speech-to-speech translation has been studied to realize natural human communication beyond language barriers. Toward further multi-modal natural communication, visual information such as face and lip movements will be necessary. In this paper, we introduce a multi-modal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion while synchronizing it to the translated speech. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a three-dimensional wire-frame model that is adaptable to any speaker. Our approach enables image synthesis and translation with an extremely small database. We conduct subjective evaluation by connected digit discrimination using data with and without audio-visual lip-synchronicity. The results confirm the sufficient quality of the proposed audio-visual translation system.

UR - http://www.scopus.com/inward/record.url?scp=0036295865&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0036295865&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:0036295865

VL - 2

BT - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

ER -