Automatic lip reading by using multimodal visual features

Shohei Takahashi, Jun Ohya

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Since long time ago, speech recognition has been researched, though it does not work well in noisy places such as in the car or in the train. In addition, people with hearing-impaired or difficulties in hearing cannot receive benefits from speech recognition. To recognize the speech automatically, visual information is also important. People understand speeches from not only audio information, but also visual information such as temporal changes in the lip shape. A vision based speech recognition method could work well in noisy places, and could be useful also for people with hearing disabilities. In this paper, we propose an automatic lip-reading method for recognizing the speech by using multimodal visual information without using any audio information such as speech recognition. First, the ASM (Active Shape Model) is used to track and detect the face and lip in a video sequence. Second, the shape, optical flow and spatial frequencies of the lip features are extracted from the lip detected by ASM. Next, the extracted multimodal features are ordered chronologically so that Support Vector Machine is performed in order to learn and classify the spoken words. Experiments for classifying several words show promising results of this proposed method.

    Original languageEnglish
    Title of host publicationProceedings of SPIE - The International Society for Optical Engineering
    Volume9025
    DOIs
    Publication statusPublished - 2014
    EventIntelligent Robots and Computer Vision XXXI: Algorithms and Techniques - San Francisco, CA
    Duration: 2014 Feb 42014 Feb 6

    Other

    OtherIntelligent Robots and Computer Vision XXXI: Algorithms and Techniques
    CitySan Francisco, CA
    Period14/2/414/2/6

    Fingerprint

    lip reading
    Speech recognition
    speech recognition
    Speech Recognition
    Audition
    hearing
    Active Shape Model
    Optical flows
    disabilities
    Optical Flow
    Support vector machines
    Disability
    classifying
    Railroad cars
    Support Vector Machine
    Classify
    Vision
    Face
    Experiments
    Experiment

    Keywords

    • active shape model
    • face detection
    • Lip-reading
    • multimodal features
    • support vector machine

    ASJC Scopus subject areas

    • Applied Mathematics
    • Computer Science Applications
    • Electrical and Electronic Engineering
    • Electronic, Optical and Magnetic Materials
    • Condensed Matter Physics

    Cite this

    Takahashi, S., & Ohya, J. (2014). Automatic lip reading by using multimodal visual features. In Proceedings of SPIE - The International Society for Optical Engineering (Vol. 9025). [902508] https://doi.org/10.1117/12.2038375

    Automatic lip reading by using multimodal visual features. / Takahashi, Shohei; Ohya, Jun.

    Proceedings of SPIE - The International Society for Optical Engineering. Vol. 9025 2014. 902508.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Takahashi, S & Ohya, J 2014, Automatic lip reading by using multimodal visual features. in Proceedings of SPIE - The International Society for Optical Engineering. vol. 9025, 902508, Intelligent Robots and Computer Vision XXXI: Algorithms and Techniques, San Francisco, CA, 14/2/4. https://doi.org/10.1117/12.2038375
    Takahashi S, Ohya J. Automatic lip reading by using multimodal visual features. In Proceedings of SPIE - The International Society for Optical Engineering. Vol. 9025. 2014. 902508 https://doi.org/10.1117/12.2038375
    Takahashi, Shohei ; Ohya, Jun. / Automatic lip reading by using multimodal visual features. Proceedings of SPIE - The International Society for Optical Engineering. Vol. 9025 2014.
    @inproceedings{9709fed9f087491482a7a4e250509c6c,
    title = "Automatic lip reading by using multimodal visual features",
    abstract = "Since long time ago, speech recognition has been researched, though it does not work well in noisy places such as in the car or in the train. In addition, people with hearing-impaired or difficulties in hearing cannot receive benefits from speech recognition. To recognize the speech automatically, visual information is also important. People understand speeches from not only audio information, but also visual information such as temporal changes in the lip shape. A vision based speech recognition method could work well in noisy places, and could be useful also for people with hearing disabilities. In this paper, we propose an automatic lip-reading method for recognizing the speech by using multimodal visual information without using any audio information such as speech recognition. First, the ASM (Active Shape Model) is used to track and detect the face and lip in a video sequence. Second, the shape, optical flow and spatial frequencies of the lip features are extracted from the lip detected by ASM. Next, the extracted multimodal features are ordered chronologically so that Support Vector Machine is performed in order to learn and classify the spoken words. Experiments for classifying several words show promising results of this proposed method.",
    keywords = "active shape model, face detection, Lip-reading, multimodal features, support vector machine",
    author = "Shohei Takahashi and Jun Ohya",
    year = "2014",
    doi = "10.1117/12.2038375",
    language = "English",
    isbn = "9780819499424",
    volume = "9025",
    booktitle = "Proceedings of SPIE - The International Society for Optical Engineering",

    }

    TY - GEN

    T1 - Automatic lip reading by using multimodal visual features

    AU - Takahashi, Shohei

    AU - Ohya, Jun

    PY - 2014

    Y1 - 2014

    N2 - Since long time ago, speech recognition has been researched, though it does not work well in noisy places such as in the car or in the train. In addition, people with hearing-impaired or difficulties in hearing cannot receive benefits from speech recognition. To recognize the speech automatically, visual information is also important. People understand speeches from not only audio information, but also visual information such as temporal changes in the lip shape. A vision based speech recognition method could work well in noisy places, and could be useful also for people with hearing disabilities. In this paper, we propose an automatic lip-reading method for recognizing the speech by using multimodal visual information without using any audio information such as speech recognition. First, the ASM (Active Shape Model) is used to track and detect the face and lip in a video sequence. Second, the shape, optical flow and spatial frequencies of the lip features are extracted from the lip detected by ASM. Next, the extracted multimodal features are ordered chronologically so that Support Vector Machine is performed in order to learn and classify the spoken words. Experiments for classifying several words show promising results of this proposed method.

    AB - Since long time ago, speech recognition has been researched, though it does not work well in noisy places such as in the car or in the train. In addition, people with hearing-impaired or difficulties in hearing cannot receive benefits from speech recognition. To recognize the speech automatically, visual information is also important. People understand speeches from not only audio information, but also visual information such as temporal changes in the lip shape. A vision based speech recognition method could work well in noisy places, and could be useful also for people with hearing disabilities. In this paper, we propose an automatic lip-reading method for recognizing the speech by using multimodal visual information without using any audio information such as speech recognition. First, the ASM (Active Shape Model) is used to track and detect the face and lip in a video sequence. Second, the shape, optical flow and spatial frequencies of the lip features are extracted from the lip detected by ASM. Next, the extracted multimodal features are ordered chronologically so that Support Vector Machine is performed in order to learn and classify the spoken words. Experiments for classifying several words show promising results of this proposed method.

    KW - active shape model

    KW - face detection

    KW - Lip-reading

    KW - multimodal features

    KW - support vector machine

    UR - http://www.scopus.com/inward/record.url?scp=84896768941&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84896768941&partnerID=8YFLogxK

    U2 - 10.1117/12.2038375

    DO - 10.1117/12.2038375

    M3 - Conference contribution

    SN - 9780819499424

    VL - 9025

    BT - Proceedings of SPIE - The International Society for Optical Engineering

    ER -