Multimodal translation system using texture-mapped lip-sync images for video mail and automatic dubbing applications

Shigeo Morishima, Satoshi Nakamura

    Research output: Contribution to journalArticle

    4 Citations (Scopus)

    Abstract

    We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.

    Original languageEnglish
    Pages (from-to)1637-1647
    Number of pages11
    JournalEurasip Journal on Applied Signal Processing
    Volume2004
    Issue number11
    DOIs
    Publication statusPublished - 2004 Sep 1

    Fingerprint

    Textures
    Speech synthesis
    Template matching
    Graphical user interfaces
    Wire

    Keywords

    • Audio-visual speech translation
    • Face tracking with 3D template
    • Lip-sync talking head
    • Personal face model
    • Texture-mapped facial animation
    • Video mail and automatic dubbing

    ASJC Scopus subject areas

    • Electrical and Electronic Engineering
    • Hardware and Architecture
    • Signal Processing

    Cite this

    Multimodal translation system using texture-mapped lip-sync images for video mail and automatic dubbing applications. / Morishima, Shigeo; Nakamura, Satoshi.

    In: Eurasip Journal on Applied Signal Processing, Vol. 2004, No. 11, 01.09.2004, p. 1637-1647.

    Research output: Contribution to journalArticle

    @article{c990ba9f40b3406d90a25f1175265acf,
    title = "Multimodal translation system using texture-mapped lip-sync images for video mail and automatic dubbing applications",
    abstract = "We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.",
    keywords = "Audio-visual speech translation, Face tracking with 3D template, Lip-sync talking head, Personal face model, Texture-mapped facial animation, Video mail and automatic dubbing",
    author = "Shigeo Morishima and Satoshi Nakamura",
    year = "2004",
    month = "9",
    day = "1",
    doi = "10.1155/S1110865704404259",
    language = "English",
    volume = "2004",
    pages = "1637--1647",
    journal = "Eurasip Journal on Advances in Signal Processing",
    issn = "1687-6172",
    publisher = "Hindawi Publishing Corporation",
    number = "11",

    }

    TY - JOUR

    T1 - Multimodal translation system using texture-mapped lip-sync images for video mail and automatic dubbing applications

    AU - Morishima, Shigeo

    AU - Nakamura, Satoshi

    PY - 2004/9/1

    Y1 - 2004/9/1

    N2 - We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.

    AB - We introduce a multimodal English-to-Japanese and Japanese-to-English translation system that also translates the speaker's speech motion by synchronizing it to the translated speech. This system also introduces both a face synthesis technique that can generate any viseme lip shape and a face tracking technique that can estimate the original position and rotation of a speaker's face in an image sequence. To retain the speaker's facial expression, we substitute only the speech organ's image with the synthesized one, which is made by a 3D wire-frame model that is adaptable to any speaker. Our approach provides translated image synthesis with an extremely small database. The tracking motion of the face from a video image is performed by template matching. In this system, the translation and rotation of the face are detected by using a 3D personal face model whose texture is captured from a video frame. We also propose a method to customize the personal face model by using our GUI tool. By combining these techniques and the translated voice synthesis technique, an automatic multimodal translation can be achieved that is suitable for video mail or automatic dubbing systems into other languages.

    KW - Audio-visual speech translation

    KW - Face tracking with 3D template

    KW - Lip-sync talking head

    KW - Personal face model

    KW - Texture-mapped facial animation

    KW - Video mail and automatic dubbing

    UR - http://www.scopus.com/inward/record.url?scp=10244240639&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=10244240639&partnerID=8YFLogxK

    U2 - 10.1155/S1110865704404259

    DO - 10.1155/S1110865704404259

    M3 - Article

    AN - SCOPUS:10244240639

    VL - 2004

    SP - 1637

    EP - 1647

    JO - Eurasip Journal on Advances in Signal Processing

    JF - Eurasip Journal on Advances in Signal Processing

    SN - 1687-6172

    IS - 11

    ER -