Stream selection and integration in multistream ASR using GMM-based performance monitoring

Tetsuji Ogawa, Feipeng Li, Hynek Hermansky

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    5 Citations (Scopus)

    Abstract

    A moderately deep and rather wide artificial neural net is applied in phoneme recognition of noisy speech. The net is formed by first estimating posterior probabilities of phonemes in 21 band-limited streams covering the whole speech spectrum. These 21 band-limited streams are subdivided into three seven band-limited stream subsets, by differently sub-sampling the original 21 band-limited streams. In the second processing stage, all non-empty combinations of seven band-limited streams from each subset are formed as inputs to 127 artificial neural nets that are again trained to yield phoneme posteriors. In this way, 127 × 3 = 381 processing streams are formed. A novel technique for finding the best combination of the resulting 381 parallel processing streams, which uses the likelihood of a single-state Gaussian mixture model of the final classifier output is applied to selecting the most efficient streams. The technique is efficient in phoneme recognition of speech that is corrupted by realistic additive noise.

    Original languageEnglish
    Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    PublisherInternational Speech and Communication Association
    Pages3332-3336
    Number of pages5
    Publication statusPublished - 2013
    Event14th Annual Conference of the International Speech Communication Association, INTERSPEECH 2013 - Lyon, France
    Duration: 2013 Aug 252013 Aug 29

    Other

    Other14th Annual Conference of the International Speech Communication Association, INTERSPEECH 2013
    CountryFrance
    CityLyon
    Period13/8/2513/8/29

    Fingerprint

    Performance Monitoring
    Monitoring
    Processing
    Neural Nets
    Neural networks
    Additive noise
    Set theory
    Classifiers
    Stream Processing
    Subsampling
    Subset
    Sampling
    Gaussian Mixture Model
    Posterior Probability
    Additive Noise
    Parallel Processing
    Likelihood
    Covering
    Classifier
    Phoneme

    Keywords

    • Gaussian mixture model
    • Multilayer perceptron
    • Multistream speech recognition
    • Performance monitoring

    ASJC Scopus subject areas

    • Language and Linguistics
    • Human-Computer Interaction
    • Signal Processing
    • Software
    • Modelling and Simulation

    Cite this

    Ogawa, T., Li, F., & Hermansky, H. (2013). Stream selection and integration in multistream ASR using GMM-based performance monitoring. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (pp. 3332-3336). International Speech and Communication Association.

    Stream selection and integration in multistream ASR using GMM-based performance monitoring. / Ogawa, Tetsuji; Li, Feipeng; Hermansky, Hynek.

    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech and Communication Association, 2013. p. 3332-3336.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Ogawa, T, Li, F & Hermansky, H 2013, Stream selection and integration in multistream ASR using GMM-based performance monitoring. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech and Communication Association, pp. 3332-3336, 14th Annual Conference of the International Speech Communication Association, INTERSPEECH 2013, Lyon, France, 13/8/25.
    Ogawa T, Li F, Hermansky H. Stream selection and integration in multistream ASR using GMM-based performance monitoring. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech and Communication Association. 2013. p. 3332-3336
    Ogawa, Tetsuji ; Li, Feipeng ; Hermansky, Hynek. / Stream selection and integration in multistream ASR using GMM-based performance monitoring. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. International Speech and Communication Association, 2013. pp. 3332-3336
    @inproceedings{311559f4a670430d905d97bff2b90aff,
    title = "Stream selection and integration in multistream ASR using GMM-based performance monitoring",
    abstract = "A moderately deep and rather wide artificial neural net is applied in phoneme recognition of noisy speech. The net is formed by first estimating posterior probabilities of phonemes in 21 band-limited streams covering the whole speech spectrum. These 21 band-limited streams are subdivided into three seven band-limited stream subsets, by differently sub-sampling the original 21 band-limited streams. In the second processing stage, all non-empty combinations of seven band-limited streams from each subset are formed as inputs to 127 artificial neural nets that are again trained to yield phoneme posteriors. In this way, 127 × 3 = 381 processing streams are formed. A novel technique for finding the best combination of the resulting 381 parallel processing streams, which uses the likelihood of a single-state Gaussian mixture model of the final classifier output is applied to selecting the most efficient streams. The technique is efficient in phoneme recognition of speech that is corrupted by realistic additive noise.",
    keywords = "Gaussian mixture model, Multilayer perceptron, Multistream speech recognition, Performance monitoring",
    author = "Tetsuji Ogawa and Feipeng Li and Hynek Hermansky",
    year = "2013",
    language = "English",
    pages = "3332--3336",
    booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
    publisher = "International Speech and Communication Association",

    }

    TY - GEN

    T1 - Stream selection and integration in multistream ASR using GMM-based performance monitoring

    AU - Ogawa, Tetsuji

    AU - Li, Feipeng

    AU - Hermansky, Hynek

    PY - 2013

    Y1 - 2013

    N2 - A moderately deep and rather wide artificial neural net is applied in phoneme recognition of noisy speech. The net is formed by first estimating posterior probabilities of phonemes in 21 band-limited streams covering the whole speech spectrum. These 21 band-limited streams are subdivided into three seven band-limited stream subsets, by differently sub-sampling the original 21 band-limited streams. In the second processing stage, all non-empty combinations of seven band-limited streams from each subset are formed as inputs to 127 artificial neural nets that are again trained to yield phoneme posteriors. In this way, 127 × 3 = 381 processing streams are formed. A novel technique for finding the best combination of the resulting 381 parallel processing streams, which uses the likelihood of a single-state Gaussian mixture model of the final classifier output is applied to selecting the most efficient streams. The technique is efficient in phoneme recognition of speech that is corrupted by realistic additive noise.

    AB - A moderately deep and rather wide artificial neural net is applied in phoneme recognition of noisy speech. The net is formed by first estimating posterior probabilities of phonemes in 21 band-limited streams covering the whole speech spectrum. These 21 band-limited streams are subdivided into three seven band-limited stream subsets, by differently sub-sampling the original 21 band-limited streams. In the second processing stage, all non-empty combinations of seven band-limited streams from each subset are formed as inputs to 127 artificial neural nets that are again trained to yield phoneme posteriors. In this way, 127 × 3 = 381 processing streams are formed. A novel technique for finding the best combination of the resulting 381 parallel processing streams, which uses the likelihood of a single-state Gaussian mixture model of the final classifier output is applied to selecting the most efficient streams. The technique is efficient in phoneme recognition of speech that is corrupted by realistic additive noise.

    KW - Gaussian mixture model

    KW - Multilayer perceptron

    KW - Multistream speech recognition

    KW - Performance monitoring

    UR - http://www.scopus.com/inward/record.url?scp=84906283768&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84906283768&partnerID=8YFLogxK

    M3 - Conference contribution

    SP - 3332

    EP - 3336

    BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

    PB - International Speech and Communication Association

    ER -