Audio-visual speech recognition using deep learning

Kuniaki Noda, Yuki Yamaguchi, Kazuhiro Nakadai, Hiroshi G. Okuno, Tetsuya Ogata

    Research output: Contribution to journalArticle

    118 Citations (Scopus)

    Abstract

    Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise. However, cautious selection of sensory features is crucial for attaining high recognition performance. In the machine-learning community, deep learning approaches have recently attracted increasing attention because deep neural networks can effectively extract robust latent features that enable various recognition algorithms to demonstrate revolutionary generalization capabilities under diverse application conditions. This study introduces a connectionist-hidden Markov model (HMM) system for noise-robust AVSR. First, a deep denoising autoencoder is utilized for acquiring noise-robust audio features. By preparing the training data for the network with pairs of consecutive multiple steps of deteriorated audio features and the corresponding clean features, the network is trained to output denoised audio features from the corresponding features deteriorated by noise. Second, a convolutional neural network (CNN) is utilized to extract visual features from raw mouth area images. By preparing the training data for the CNN as pairs of raw images and the corresponding phoneme label outputs, the network is trained to predict phoneme labels from the corresponding mouth area input images. Finally, a multi-stream HMM (MSHMM) is applied for integrating the acquired audio and visual HMMs independently trained with the respective features. By comparing the cases when normal and denoised mel-frequency cepstral coefficients (MFCCs) are utilized as audio features to the HMM, our unimodal isolated word recognition results demonstrate that approximately 65 % word recognition rate gain is attained with denoised MFCCs under 10 dB signal-to-noise-ratio (SNR) for the audio signal input. Moreover, our multimodal isolated word recognition results utilizing MSHMM with denoised MFCCs and acquired visual features demonstrate that an additional word recognition rate gain is attained for the SNR conditions below 10 dB.

    Original languageEnglish
    Pages (from-to)722-737
    Number of pages16
    JournalApplied Intelligence
    Volume42
    Issue number4
    DOIs
    Publication statusPublished - 2015 Jun 1

    Fingerprint

    Hidden Markov models
    Speech recognition
    Labels
    Signal to noise ratio
    Neural networks
    Acoustic noise
    Learning systems
    Deep learning

    Keywords

    • Audio-visual speech recognition
    • Deep learning
    • Feature extraction
    • Multi-stream HMM

    ASJC Scopus subject areas

    • Artificial Intelligence

    Cite this

    Audio-visual speech recognition using deep learning. / Noda, Kuniaki; Yamaguchi, Yuki; Nakadai, Kazuhiro; Okuno, Hiroshi G.; Ogata, Tetsuya.

    In: Applied Intelligence, Vol. 42, No. 4, 01.06.2015, p. 722-737.

    Research output: Contribution to journalArticle

    Noda, Kuniaki ; Yamaguchi, Yuki ; Nakadai, Kazuhiro ; Okuno, Hiroshi G. ; Ogata, Tetsuya. / Audio-visual speech recognition using deep learning. In: Applied Intelligence. 2015 ; Vol. 42, No. 4. pp. 722-737.
    @article{6924b363d7284b6aa53c04efbf70fd59,
    title = "Audio-visual speech recognition using deep learning",
    abstract = "Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise. However, cautious selection of sensory features is crucial for attaining high recognition performance. In the machine-learning community, deep learning approaches have recently attracted increasing attention because deep neural networks can effectively extract robust latent features that enable various recognition algorithms to demonstrate revolutionary generalization capabilities under diverse application conditions. This study introduces a connectionist-hidden Markov model (HMM) system for noise-robust AVSR. First, a deep denoising autoencoder is utilized for acquiring noise-robust audio features. By preparing the training data for the network with pairs of consecutive multiple steps of deteriorated audio features and the corresponding clean features, the network is trained to output denoised audio features from the corresponding features deteriorated by noise. Second, a convolutional neural network (CNN) is utilized to extract visual features from raw mouth area images. By preparing the training data for the CNN as pairs of raw images and the corresponding phoneme label outputs, the network is trained to predict phoneme labels from the corresponding mouth area input images. Finally, a multi-stream HMM (MSHMM) is applied for integrating the acquired audio and visual HMMs independently trained with the respective features. By comparing the cases when normal and denoised mel-frequency cepstral coefficients (MFCCs) are utilized as audio features to the HMM, our unimodal isolated word recognition results demonstrate that approximately 65 {\%} word recognition rate gain is attained with denoised MFCCs under 10 dB signal-to-noise-ratio (SNR) for the audio signal input. Moreover, our multimodal isolated word recognition results utilizing MSHMM with denoised MFCCs and acquired visual features demonstrate that an additional word recognition rate gain is attained for the SNR conditions below 10 dB.",
    keywords = "Audio-visual speech recognition, Deep learning, Feature extraction, Multi-stream HMM",
    author = "Kuniaki Noda and Yuki Yamaguchi and Kazuhiro Nakadai and Okuno, {Hiroshi G.} and Tetsuya Ogata",
    year = "2015",
    month = "6",
    day = "1",
    doi = "10.1007/s10489-014-0629-7",
    language = "English",
    volume = "42",
    pages = "722--737",
    journal = "Applied Intelligence",
    issn = "0924-669X",
    publisher = "Springer Netherlands",
    number = "4",

    }

    TY - JOUR

    T1 - Audio-visual speech recognition using deep learning

    AU - Noda, Kuniaki

    AU - Yamaguchi, Yuki

    AU - Nakadai, Kazuhiro

    AU - Okuno, Hiroshi G.

    AU - Ogata, Tetsuya

    PY - 2015/6/1

    Y1 - 2015/6/1

    N2 - Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise. However, cautious selection of sensory features is crucial for attaining high recognition performance. In the machine-learning community, deep learning approaches have recently attracted increasing attention because deep neural networks can effectively extract robust latent features that enable various recognition algorithms to demonstrate revolutionary generalization capabilities under diverse application conditions. This study introduces a connectionist-hidden Markov model (HMM) system for noise-robust AVSR. First, a deep denoising autoencoder is utilized for acquiring noise-robust audio features. By preparing the training data for the network with pairs of consecutive multiple steps of deteriorated audio features and the corresponding clean features, the network is trained to output denoised audio features from the corresponding features deteriorated by noise. Second, a convolutional neural network (CNN) is utilized to extract visual features from raw mouth area images. By preparing the training data for the CNN as pairs of raw images and the corresponding phoneme label outputs, the network is trained to predict phoneme labels from the corresponding mouth area input images. Finally, a multi-stream HMM (MSHMM) is applied for integrating the acquired audio and visual HMMs independently trained with the respective features. By comparing the cases when normal and denoised mel-frequency cepstral coefficients (MFCCs) are utilized as audio features to the HMM, our unimodal isolated word recognition results demonstrate that approximately 65 % word recognition rate gain is attained with denoised MFCCs under 10 dB signal-to-noise-ratio (SNR) for the audio signal input. Moreover, our multimodal isolated word recognition results utilizing MSHMM with denoised MFCCs and acquired visual features demonstrate that an additional word recognition rate gain is attained for the SNR conditions below 10 dB.

    AB - Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for reliable speech recognition, particularly when the audio is corrupted by noise. However, cautious selection of sensory features is crucial for attaining high recognition performance. In the machine-learning community, deep learning approaches have recently attracted increasing attention because deep neural networks can effectively extract robust latent features that enable various recognition algorithms to demonstrate revolutionary generalization capabilities under diverse application conditions. This study introduces a connectionist-hidden Markov model (HMM) system for noise-robust AVSR. First, a deep denoising autoencoder is utilized for acquiring noise-robust audio features. By preparing the training data for the network with pairs of consecutive multiple steps of deteriorated audio features and the corresponding clean features, the network is trained to output denoised audio features from the corresponding features deteriorated by noise. Second, a convolutional neural network (CNN) is utilized to extract visual features from raw mouth area images. By preparing the training data for the CNN as pairs of raw images and the corresponding phoneme label outputs, the network is trained to predict phoneme labels from the corresponding mouth area input images. Finally, a multi-stream HMM (MSHMM) is applied for integrating the acquired audio and visual HMMs independently trained with the respective features. By comparing the cases when normal and denoised mel-frequency cepstral coefficients (MFCCs) are utilized as audio features to the HMM, our unimodal isolated word recognition results demonstrate that approximately 65 % word recognition rate gain is attained with denoised MFCCs under 10 dB signal-to-noise-ratio (SNR) for the audio signal input. Moreover, our multimodal isolated word recognition results utilizing MSHMM with denoised MFCCs and acquired visual features demonstrate that an additional word recognition rate gain is attained for the SNR conditions below 10 dB.

    KW - Audio-visual speech recognition

    KW - Deep learning

    KW - Feature extraction

    KW - Multi-stream HMM

    UR - http://www.scopus.com/inward/record.url?scp=84939956018&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84939956018&partnerID=8YFLogxK

    U2 - 10.1007/s10489-014-0629-7

    DO - 10.1007/s10489-014-0629-7

    M3 - Article

    VL - 42

    SP - 722

    EP - 737

    JO - Applied Intelligence

    JF - Applied Intelligence

    SN - 0924-669X

    IS - 4

    ER -