Lipreading using convolutional neural network

Kuniaki Noda, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G. Okuno, Tetsuya Ogata

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    29 Citations (Scopus)

    Abstract

    In recent automatic speech recognition studies, deep learning architecture applications for acoustic modeling have eclipsed conventional sound features such as Mel-frequency cepstral co- efficients. However, for visual speech recognition (VSR) stud- ies, handcrafted visual feature extraction mechanisms are still widely utilized. In this paper, we propose to apply a convo- lutional neural network (CNN) as a visual feature extraction mechanism for VSR. By training a CNN with images of a speaker's mouth area in combination with phoneme labels, the CNN acquires multiple convolutional filters, used to extract vi- sual features essential for recognizing phonemes. Further, by modeling the temporal dependencies of the generated phoneme label sequences, a hidden Markov model in our proposed sys- Tem recognizes multiple isolated words. Our proposed system is evaluated on an audio-visual speech dataset comprising 300 Japanese words with six different speakers. The evaluation re- sults of our isolated word recognition experiment demonstrate that the visual features acquired by the CNN significantly out- perform those acquired by conventional dimensionality com- pression approaches, including principal component analysis.

    Original languageEnglish
    Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    PublisherInternational Speech and Communication Association
    Pages1149-1153
    Number of pages5
    Publication statusPublished - 2014
    Event15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages, INTERSPEECH 2014 - Singapore, Singapore
    Duration: 2014 Sep 142014 Sep 18

    Other

    Other15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages, INTERSPEECH 2014
    CountrySingapore
    CitySingapore
    Period14/9/1414/9/18

    Fingerprint

    Speech recognition
    Neural Networks
    Neural networks
    Feature extraction
    Labels
    Speech Recognition
    Feature Extraction
    Hidden Markov models
    Principal component analysis
    Automatic Speech Recognition
    Acoustics
    Acoustic waves
    Modeling
    Markov Model
    Principal Component Analysis
    Dimensionality
    Vision
    Lipreading
    Filter
    Experiments

    Keywords

    • Convolu- Tional neural network
    • Lipreading
    • Visual feature extraction

    ASJC Scopus subject areas

    • Language and Linguistics
    • Human-Computer Interaction
    • Signal Processing
    • Software
    • Modelling and Simulation

    Cite this

    Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H. G., & Ogata, T. (2014). Lipreading using convolutional neural network. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (pp. 1149-1153). International Speech and Communication Association.

    Lipreading using convolutional neural network. / Noda, Kuniaki; Yamaguchi, Yuki; Nakadai, Kazuhiro; Okuno, Hiroshi G.; Ogata, Tetsuya.

    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech and Communication Association, 2014. p. 1149-1153.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Noda, K, Yamaguchi, Y, Nakadai, K, Okuno, HG & Ogata, T 2014, Lipreading using convolutional neural network. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech and Communication Association, pp. 1149-1153, 15th Annual Conference of the International Speech Communication Association: Celebrating the Diversity of Spoken Languages, INTERSPEECH 2014, Singapore, Singapore, 14/9/14.
    Noda K, Yamaguchi Y, Nakadai K, Okuno HG, Ogata T. Lipreading using convolutional neural network. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech and Communication Association. 2014. p. 1149-1153
    Noda, Kuniaki ; Yamaguchi, Yuki ; Nakadai, Kazuhiro ; Okuno, Hiroshi G. ; Ogata, Tetsuya. / Lipreading using convolutional neural network. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech and Communication Association, 2014. pp. 1149-1153
    @inproceedings{a80e82ad3fd04b67b03238a113fd6a90,
    title = "Lipreading using convolutional neural network",
    abstract = "In recent automatic speech recognition studies, deep learning architecture applications for acoustic modeling have eclipsed conventional sound features such as Mel-frequency cepstral co- efficients. However, for visual speech recognition (VSR) stud- ies, handcrafted visual feature extraction mechanisms are still widely utilized. In this paper, we propose to apply a convo- lutional neural network (CNN) as a visual feature extraction mechanism for VSR. By training a CNN with images of a speaker's mouth area in combination with phoneme labels, the CNN acquires multiple convolutional filters, used to extract vi- sual features essential for recognizing phonemes. Further, by modeling the temporal dependencies of the generated phoneme label sequences, a hidden Markov model in our proposed sys- Tem recognizes multiple isolated words. Our proposed system is evaluated on an audio-visual speech dataset comprising 300 Japanese words with six different speakers. The evaluation re- sults of our isolated word recognition experiment demonstrate that the visual features acquired by the CNN significantly out- perform those acquired by conventional dimensionality com- pression approaches, including principal component analysis.",
    keywords = "Convolu- Tional neural network, Lipreading, Visual feature extraction",
    author = "Kuniaki Noda and Yuki Yamaguchi and Kazuhiro Nakadai and Okuno, {Hiroshi G.} and Tetsuya Ogata",
    year = "2014",
    language = "English",
    pages = "1149--1153",
    booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
    publisher = "International Speech and Communication Association",

    }

    TY - GEN

    T1 - Lipreading using convolutional neural network

    AU - Noda, Kuniaki

    AU - Yamaguchi, Yuki

    AU - Nakadai, Kazuhiro

    AU - Okuno, Hiroshi G.

    AU - Ogata, Tetsuya

    PY - 2014

    Y1 - 2014

    N2 - In recent automatic speech recognition studies, deep learning architecture applications for acoustic modeling have eclipsed conventional sound features such as Mel-frequency cepstral co- efficients. However, for visual speech recognition (VSR) stud- ies, handcrafted visual feature extraction mechanisms are still widely utilized. In this paper, we propose to apply a convo- lutional neural network (CNN) as a visual feature extraction mechanism for VSR. By training a CNN with images of a speaker's mouth area in combination with phoneme labels, the CNN acquires multiple convolutional filters, used to extract vi- sual features essential for recognizing phonemes. Further, by modeling the temporal dependencies of the generated phoneme label sequences, a hidden Markov model in our proposed sys- Tem recognizes multiple isolated words. Our proposed system is evaluated on an audio-visual speech dataset comprising 300 Japanese words with six different speakers. The evaluation re- sults of our isolated word recognition experiment demonstrate that the visual features acquired by the CNN significantly out- perform those acquired by conventional dimensionality com- pression approaches, including principal component analysis.

    AB - In recent automatic speech recognition studies, deep learning architecture applications for acoustic modeling have eclipsed conventional sound features such as Mel-frequency cepstral co- efficients. However, for visual speech recognition (VSR) stud- ies, handcrafted visual feature extraction mechanisms are still widely utilized. In this paper, we propose to apply a convo- lutional neural network (CNN) as a visual feature extraction mechanism for VSR. By training a CNN with images of a speaker's mouth area in combination with phoneme labels, the CNN acquires multiple convolutional filters, used to extract vi- sual features essential for recognizing phonemes. Further, by modeling the temporal dependencies of the generated phoneme label sequences, a hidden Markov model in our proposed sys- Tem recognizes multiple isolated words. Our proposed system is evaluated on an audio-visual speech dataset comprising 300 Japanese words with six different speakers. The evaluation re- sults of our isolated word recognition experiment demonstrate that the visual features acquired by the CNN significantly out- perform those acquired by conventional dimensionality com- pression approaches, including principal component analysis.

    KW - Convolu- Tional neural network

    KW - Lipreading

    KW - Visual feature extraction

    UR - http://www.scopus.com/inward/record.url?scp=84910090408&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84910090408&partnerID=8YFLogxK

    M3 - Conference contribution

    SP - 1149

    EP - 1153

    BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

    PB - International Speech and Communication Association

    ER -