Bilinear map of filter-bank outputs for DNN-based speech recognition

Tetsuji Ogawa, Kenshiro Ueda, Kouichi Katsurada, Tetsunori Kobayashi, Tsuneo Nitta

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    1 Citation (Scopus)

    Abstract

    Filter-bank outputs are extended into tensors to yield precise acoustic features for speech recognition using deep neural networks (DNNs). The filter-bank outputs with temporal contexts form a time-frequency pattern of speech and have been shown to be effective as a feature parameter for DNN-based acoustic models. We attempt to project the filter-bank outputs onto a tensor product space using decorrelation followed by a bilinear map to improve acoustic separability in feature extraction. This extension makes extracting a more precise structure of the time-frequency pattern possible because the bilinear map yields higher-order correlations of features. Experimental comparisons carried out in phoneme recognition demonstrate that the tensor feature provides comparable results to the filter-bank feature, and the fusion of the two features yields an improvement over each feature.

    Original languageEnglish
    Title of host publicationProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    PublisherInternational Speech and Communication Association
    Pages16-20
    Number of pages5
    Volume2015-January
    Publication statusPublished - 2015
    Event16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany
    Duration: 2015 Sep 62015 Sep 10

    Other

    Other16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015
    CountryGermany
    CityDresden
    Period15/9/615/9/10

    Fingerprint

    Bilinear Map
    Filter Banks
    Filter banks
    Speech Recognition
    Speech recognition
    Neural Networks
    Tensors
    Output
    Acoustics
    Tensor
    Acoustic Model
    Product Space
    Separability
    Tensor Product
    Feature Extraction
    Feature extraction
    Fusion
    Fusion reactions
    Higher Order
    Deep neural networks

    Keywords

    • Bilinear map
    • Deep neural network
    • Feature extraction
    • Speech recognition
    • Tensor

    ASJC Scopus subject areas

    • Language and Linguistics
    • Human-Computer Interaction
    • Signal Processing
    • Software
    • Modelling and Simulation

    Cite this

    Ogawa, T., Ueda, K., Katsurada, K., Kobayashi, T., & Nitta, T. (2015). Bilinear map of filter-bank outputs for DNN-based speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Vol. 2015-January, pp. 16-20). International Speech and Communication Association.

    Bilinear map of filter-bank outputs for DNN-based speech recognition. / Ogawa, Tetsuji; Ueda, Kenshiro; Katsurada, Kouichi; Kobayashi, Tetsunori; Nitta, Tsuneo.

    Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January International Speech and Communication Association, 2015. p. 16-20.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Ogawa, T, Ueda, K, Katsurada, K, Kobayashi, T & Nitta, T 2015, Bilinear map of filter-bank outputs for DNN-based speech recognition. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. vol. 2015-January, International Speech and Communication Association, pp. 16-20, 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015, Dresden, Germany, 15/9/6.
    Ogawa T, Ueda K, Katsurada K, Kobayashi T, Nitta T. Bilinear map of filter-bank outputs for DNN-based speech recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January. International Speech and Communication Association. 2015. p. 16-20
    Ogawa, Tetsuji ; Ueda, Kenshiro ; Katsurada, Kouichi ; Kobayashi, Tetsunori ; Nitta, Tsuneo. / Bilinear map of filter-bank outputs for DNN-based speech recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. Vol. 2015-January International Speech and Communication Association, 2015. pp. 16-20
    @inproceedings{99093cab2edd4961ac820211c2589472,
    title = "Bilinear map of filter-bank outputs for DNN-based speech recognition",
    abstract = "Filter-bank outputs are extended into tensors to yield precise acoustic features for speech recognition using deep neural networks (DNNs). The filter-bank outputs with temporal contexts form a time-frequency pattern of speech and have been shown to be effective as a feature parameter for DNN-based acoustic models. We attempt to project the filter-bank outputs onto a tensor product space using decorrelation followed by a bilinear map to improve acoustic separability in feature extraction. This extension makes extracting a more precise structure of the time-frequency pattern possible because the bilinear map yields higher-order correlations of features. Experimental comparisons carried out in phoneme recognition demonstrate that the tensor feature provides comparable results to the filter-bank feature, and the fusion of the two features yields an improvement over each feature.",
    keywords = "Bilinear map, Deep neural network, Feature extraction, Speech recognition, Tensor",
    author = "Tetsuji Ogawa and Kenshiro Ueda and Kouichi Katsurada and Tetsunori Kobayashi and Tsuneo Nitta",
    year = "2015",
    language = "English",
    volume = "2015-January",
    pages = "16--20",
    booktitle = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
    publisher = "International Speech and Communication Association",

    }

    TY - GEN

    T1 - Bilinear map of filter-bank outputs for DNN-based speech recognition

    AU - Ogawa, Tetsuji

    AU - Ueda, Kenshiro

    AU - Katsurada, Kouichi

    AU - Kobayashi, Tetsunori

    AU - Nitta, Tsuneo

    PY - 2015

    Y1 - 2015

    N2 - Filter-bank outputs are extended into tensors to yield precise acoustic features for speech recognition using deep neural networks (DNNs). The filter-bank outputs with temporal contexts form a time-frequency pattern of speech and have been shown to be effective as a feature parameter for DNN-based acoustic models. We attempt to project the filter-bank outputs onto a tensor product space using decorrelation followed by a bilinear map to improve acoustic separability in feature extraction. This extension makes extracting a more precise structure of the time-frequency pattern possible because the bilinear map yields higher-order correlations of features. Experimental comparisons carried out in phoneme recognition demonstrate that the tensor feature provides comparable results to the filter-bank feature, and the fusion of the two features yields an improvement over each feature.

    AB - Filter-bank outputs are extended into tensors to yield precise acoustic features for speech recognition using deep neural networks (DNNs). The filter-bank outputs with temporal contexts form a time-frequency pattern of speech and have been shown to be effective as a feature parameter for DNN-based acoustic models. We attempt to project the filter-bank outputs onto a tensor product space using decorrelation followed by a bilinear map to improve acoustic separability in feature extraction. This extension makes extracting a more precise structure of the time-frequency pattern possible because the bilinear map yields higher-order correlations of features. Experimental comparisons carried out in phoneme recognition demonstrate that the tensor feature provides comparable results to the filter-bank feature, and the fusion of the two features yields an improvement over each feature.

    KW - Bilinear map

    KW - Deep neural network

    KW - Feature extraction

    KW - Speech recognition

    KW - Tensor

    UR - http://www.scopus.com/inward/record.url?scp=84959086581&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84959086581&partnerID=8YFLogxK

    M3 - Conference contribution

    VL - 2015-January

    SP - 16

    EP - 20

    BT - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

    PB - International Speech and Communication Association

    ER -