Autoencoder based multi-stream combination for noise robust speech recognition

Sri Harish Mallidi, Tetsuji Ogawa, Karel Vesely, Phani S. Nidadavolu, Hynek Hermansky

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    12 Citations (Scopus)

    Abstract

    Performances of automatic speech recognition (ASR) systems degrade rapidly when there is a mismatch between train and test acoustic conditions. Performance can be improved using a multi-stream framework, which involves combining posterior probabilities from several classifiers (often deep neural networks (DNNs)) trained on different features/streams. Knowledge about the confidence of each of these classifiers on a noisy test utterance can help in devising better techniques for posterior combination than simple sum and product rules [1]. In this work, we propose to use autoencoders which are multilayer feed forward neural networks, for estimating this confidence measure. During the training phase, for each stream, an autocoder is trained on TANDEM features extracted from the corresponding DNN. On employing the autoencoder during the testing phase, we show that the reconstruction error of the autoencoder is correlated to the robustness of the corresponding stream. These error estimates are then used as confidence measures to combine the posterior probabilities generated from each of the streams. Experiments on Aurora4 and BABEL databases indicate significant improvements, especially in the scenario of mismatch between train and test acoustic conditions.

    Original languageEnglish
    Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    PublisherInternational Speech and Communication Association
    Pages3551-3555
    Number of pages5
    Volume2015-January
    Publication statusPublished - 2015
    Event16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany
    Duration: 2015 Sep 62015 Sep 10

    Other

    Other16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015
    CountryGermany
    CityDresden
    Period15/9/615/9/10

    Fingerprint

    Robust Speech Recognition
    Speech recognition
    Acoustic noise
    Confidence Measure
    Classifiers
    Acoustics
    Posterior Probability
    Feedforward neural networks
    Multilayer neural networks
    Classifier
    Neural Networks
    Product rule
    Automatic Speech Recognition
    Sum Rules
    Feedforward Neural Networks
    Confidence
    Multilayer
    Error Estimates
    Testing
    Robustness

    Keywords

    • Computational paralinguistics
    • Human-computer interaction
    • Speech recognition

    ASJC Scopus subject areas

    • Language and Linguistics
    • Human-Computer Interaction
    • Signal Processing
    • Software
    • Modelling and Simulation

    Cite this

    Mallidi, S. H., Ogawa, T., Vesely, K., Nidadavolu, P. S., & Hermansky, H. (2015). Autoencoder based multi-stream combination for noise robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Vol. 2015-January, pp. 3551-3555). International Speech and Communication Association.

    Autoencoder based multi-stream combination for noise robust speech recognition. / Mallidi, Sri Harish; Ogawa, Tetsuji; Vesely, Karel; Nidadavolu, Phani S.; Hermansky, Hynek.

    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January International Speech and Communication Association, 2015. p. 3551-3555.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Mallidi, SH, Ogawa, T, Vesely, K, Nidadavolu, PS & Hermansky, H 2015, Autoencoder based multi-stream combination for noise robust speech recognition. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. vol. 2015-January, International Speech and Communication Association, pp. 3551-3555, 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015, Dresden, Germany, 15/9/6.
    Mallidi SH, Ogawa T, Vesely K, Nidadavolu PS, Hermansky H. Autoencoder based multi-stream combination for noise robust speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January. International Speech and Communication Association. 2015. p. 3551-3555
    Mallidi, Sri Harish ; Ogawa, Tetsuji ; Vesely, Karel ; Nidadavolu, Phani S. ; Hermansky, Hynek. / Autoencoder based multi-stream combination for noise robust speech recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January International Speech and Communication Association, 2015. pp. 3551-3555
    @inproceedings{94febe809f1042909d77e6476dd572ed,
    title = "Autoencoder based multi-stream combination for noise robust speech recognition",
    abstract = "Performances of automatic speech recognition (ASR) systems degrade rapidly when there is a mismatch between train and test acoustic conditions. Performance can be improved using a multi-stream framework, which involves combining posterior probabilities from several classifiers (often deep neural networks (DNNs)) trained on different features/streams. Knowledge about the confidence of each of these classifiers on a noisy test utterance can help in devising better techniques for posterior combination than simple sum and product rules [1]. In this work, we propose to use autoencoders which are multilayer feed forward neural networks, for estimating this confidence measure. During the training phase, for each stream, an autocoder is trained on TANDEM features extracted from the corresponding DNN. On employing the autoencoder during the testing phase, we show that the reconstruction error of the autoencoder is correlated to the robustness of the corresponding stream. These error estimates are then used as confidence measures to combine the posterior probabilities generated from each of the streams. Experiments on Aurora4 and BABEL databases indicate significant improvements, especially in the scenario of mismatch between train and test acoustic conditions.",
    keywords = "Computational paralinguistics, Human-computer interaction, Speech recognition",
    author = "Mallidi, {Sri Harish} and Tetsuji Ogawa and Karel Vesely and Nidadavolu, {Phani S.} and Hynek Hermansky",
    year = "2015",
    language = "English",
    volume = "2015-January",
    pages = "3551--3555",
    booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
    publisher = "International Speech and Communication Association",

    }

    TY - GEN

    T1 - Autoencoder based multi-stream combination for noise robust speech recognition

    AU - Mallidi, Sri Harish

    AU - Ogawa, Tetsuji

    AU - Vesely, Karel

    AU - Nidadavolu, Phani S.

    AU - Hermansky, Hynek

    PY - 2015

    Y1 - 2015

    N2 - Performances of automatic speech recognition (ASR) systems degrade rapidly when there is a mismatch between train and test acoustic conditions. Performance can be improved using a multi-stream framework, which involves combining posterior probabilities from several classifiers (often deep neural networks (DNNs)) trained on different features/streams. Knowledge about the confidence of each of these classifiers on a noisy test utterance can help in devising better techniques for posterior combination than simple sum and product rules [1]. In this work, we propose to use autoencoders which are multilayer feed forward neural networks, for estimating this confidence measure. During the training phase, for each stream, an autocoder is trained on TANDEM features extracted from the corresponding DNN. On employing the autoencoder during the testing phase, we show that the reconstruction error of the autoencoder is correlated to the robustness of the corresponding stream. These error estimates are then used as confidence measures to combine the posterior probabilities generated from each of the streams. Experiments on Aurora4 and BABEL databases indicate significant improvements, especially in the scenario of mismatch between train and test acoustic conditions.

    AB - Performances of automatic speech recognition (ASR) systems degrade rapidly when there is a mismatch between train and test acoustic conditions. Performance can be improved using a multi-stream framework, which involves combining posterior probabilities from several classifiers (often deep neural networks (DNNs)) trained on different features/streams. Knowledge about the confidence of each of these classifiers on a noisy test utterance can help in devising better techniques for posterior combination than simple sum and product rules [1]. In this work, we propose to use autoencoders which are multilayer feed forward neural networks, for estimating this confidence measure. During the training phase, for each stream, an autocoder is trained on TANDEM features extracted from the corresponding DNN. On employing the autoencoder during the testing phase, we show that the reconstruction error of the autoencoder is correlated to the robustness of the corresponding stream. These error estimates are then used as confidence measures to combine the posterior probabilities generated from each of the streams. Experiments on Aurora4 and BABEL databases indicate significant improvements, especially in the scenario of mismatch between train and test acoustic conditions.

    KW - Computational paralinguistics

    KW - Human-computer interaction

    KW - Speech recognition

    UR - http://www.scopus.com/inward/record.url?scp=84959165456&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84959165456&partnerID=8YFLogxK

    M3 - Conference contribution

    VL - 2015-January

    SP - 3551

    EP - 3555

    BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

    PB - International Speech and Communication Association

    ER -