Abstract
Speech-to-speech translation has been studied to realize natural human communication beyond language barriers. Toward further multi-modal natural communication, visual information such as face and lip movements will be necessary. We introduce a multi-modal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion while synchronizing it to the translated speech. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a three-dimensional wire-frame model that is adaptable to any speaker. Our approach enables image synthesis and translation with an extremely small database. We conduct subjective evaluation tests using the connected digit discrimination test using data with and without audio-visual lip-synchronization. The results confirm the significant quality of the proposed audio-visual translation system and the importance of lip-synchronization.
Original language | English |
---|---|
Title of host publication | Proceedings - 4th IEEE International Conference on Multimodal Interfaces, ICMI 2002 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 241-246 |
Number of pages | 6 |
ISBN (Print) | 0769518346, 9780769518343 |
DOIs | |
Publication status | Published - 2002 |
Externally published | Yes |
Event | 4th IEEE International Conference on Multimodal Interfaces, ICMI 2002 - Pittsburgh, United States Duration: 2002 Oct 14 → 2002 Oct 16 |
Other
Other | 4th IEEE International Conference on Multimodal Interfaces, ICMI 2002 |
---|---|
Country/Territory | United States |
City | Pittsburgh |
Period | 02/10/14 → 02/10/16 |
ASJC Scopus subject areas
- Artificial Intelligence
- Computer Graphics and Computer-Aided Design
- Computer Vision and Pattern Recognition
- Hardware and Architecture