Construction of audio-visual speech corpus using motion-capture system and corpus based facial animation

Tatsuo Yotsukura, Shigeo Morishima, Satoshi Nakamura

    Research output: Contribution to journalArticle

    3 Citations (Scopus)

    Abstract

    An accurate audio-visual speech corpus is inevitable for talking-heads research. This paper presents our audio-visual speech corpus collection and proposes a head-movement normalization method and a facial motion generation method. The audio-visual corpus contains speech data, movie data on faces, and positions and movements of facial organs. The corpus consists of Japanese phoneme-balanced sentences uttered by a female native speaker. An accurate facial capture is realized by using an optical motion-capture system. We captured high-resolution 3D data by arranging many markers on the speaker's face. In addition, we propose a method of acquiring the facial movements and removing head movements by using affine transformation for computing displacements of pure facial organs. Finally, in order to easily create facial animation from this motion data, we propose a technique assigning the captured data to the facial polygon model. Evaluation results demonstrate the effectiveness of the proposed facial motion generation method and show the relationship between the number of markers and errors.

    Original languageEnglish
    Pages (from-to)2477-2483
    Number of pages7
    JournalIEICE Transactions on Information and Systems
    VolumeE88-D
    Issue number11
    DOIs
    Publication statusPublished - 2005 Nov

    Fingerprint

    Animation

    Keywords

    • Audio-visual corpus
    • Facial animation
    • Motion capture
    • Talking head

    ASJC Scopus subject areas

    • Information Systems
    • Computer Graphics and Computer-Aided Design
    • Software

    Cite this

    Construction of audio-visual speech corpus using motion-capture system and corpus based facial animation. / Yotsukura, Tatsuo; Morishima, Shigeo; Nakamura, Satoshi.

    In: IEICE Transactions on Information and Systems, Vol. E88-D, No. 11, 11.2005, p. 2477-2483.

    Research output: Contribution to journalArticle

    @article{92090fb8e7124c4081cdc90ed2135434,
    title = "Construction of audio-visual speech corpus using motion-capture system and corpus based facial animation",
    abstract = "An accurate audio-visual speech corpus is inevitable for talking-heads research. This paper presents our audio-visual speech corpus collection and proposes a head-movement normalization method and a facial motion generation method. The audio-visual corpus contains speech data, movie data on faces, and positions and movements of facial organs. The corpus consists of Japanese phoneme-balanced sentences uttered by a female native speaker. An accurate facial capture is realized by using an optical motion-capture system. We captured high-resolution 3D data by arranging many markers on the speaker's face. In addition, we propose a method of acquiring the facial movements and removing head movements by using affine transformation for computing displacements of pure facial organs. Finally, in order to easily create facial animation from this motion data, we propose a technique assigning the captured data to the facial polygon model. Evaluation results demonstrate the effectiveness of the proposed facial motion generation method and show the relationship between the number of markers and errors.",
    keywords = "Audio-visual corpus, Facial animation, Motion capture, Talking head",
    author = "Tatsuo Yotsukura and Shigeo Morishima and Satoshi Nakamura",
    year = "2005",
    month = "11",
    doi = "10.1093/ietisy/e88-d.11.2477",
    language = "English",
    volume = "E88-D",
    pages = "2477--2483",
    journal = "IEICE Transactions on Information and Systems",
    issn = "0916-8532",
    publisher = "Maruzen Co., Ltd/Maruzen Kabushikikaisha",
    number = "11",

    }

    TY - JOUR

    T1 - Construction of audio-visual speech corpus using motion-capture system and corpus based facial animation

    AU - Yotsukura, Tatsuo

    AU - Morishima, Shigeo

    AU - Nakamura, Satoshi

    PY - 2005/11

    Y1 - 2005/11

    N2 - An accurate audio-visual speech corpus is inevitable for talking-heads research. This paper presents our audio-visual speech corpus collection and proposes a head-movement normalization method and a facial motion generation method. The audio-visual corpus contains speech data, movie data on faces, and positions and movements of facial organs. The corpus consists of Japanese phoneme-balanced sentences uttered by a female native speaker. An accurate facial capture is realized by using an optical motion-capture system. We captured high-resolution 3D data by arranging many markers on the speaker's face. In addition, we propose a method of acquiring the facial movements and removing head movements by using affine transformation for computing displacements of pure facial organs. Finally, in order to easily create facial animation from this motion data, we propose a technique assigning the captured data to the facial polygon model. Evaluation results demonstrate the effectiveness of the proposed facial motion generation method and show the relationship between the number of markers and errors.

    AB - An accurate audio-visual speech corpus is inevitable for talking-heads research. This paper presents our audio-visual speech corpus collection and proposes a head-movement normalization method and a facial motion generation method. The audio-visual corpus contains speech data, movie data on faces, and positions and movements of facial organs. The corpus consists of Japanese phoneme-balanced sentences uttered by a female native speaker. An accurate facial capture is realized by using an optical motion-capture system. We captured high-resolution 3D data by arranging many markers on the speaker's face. In addition, we propose a method of acquiring the facial movements and removing head movements by using affine transformation for computing displacements of pure facial organs. Finally, in order to easily create facial animation from this motion data, we propose a technique assigning the captured data to the facial polygon model. Evaluation results demonstrate the effectiveness of the proposed facial motion generation method and show the relationship between the number of markers and errors.

    KW - Audio-visual corpus

    KW - Facial animation

    KW - Motion capture

    KW - Talking head

    UR - http://www.scopus.com/inward/record.url?scp=29144518635&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=29144518635&partnerID=8YFLogxK

    U2 - 10.1093/ietisy/e88-d.11.2477

    DO - 10.1093/ietisy/e88-d.11.2477

    M3 - Article

    AN - SCOPUS:29144518635

    VL - E88-D

    SP - 2477

    EP - 2483

    JO - IEICE Transactions on Information and Systems

    JF - IEICE Transactions on Information and Systems

    SN - 0916-8532

    IS - 11

    ER -