Sound source separation for robot audition using deep learning

Kuniaki Noda, Naoya Hashimoto, Kazuhiro Nakadai, Tetsuya Ogata

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    1 Citation (Scopus)

    Abstract

    Noise robust speech recognition is crucial for effective human-machine interaction in real-world environments. Sound source separation (SSS) is one of the most widely used approaches for addressing noise robust speech recognition by extracting a target speaker's speech signal while suppressing simultaneous unintended signals. However, conventional SSS algorithms, such as independent component analysis or nonlinear principal component analysis, are limited in modeling complex projections with scalability. Moreover, conventional systems required designing an independent subsystem for noise reduction (NR) in addition to the SSS. To overcome these issues, we propose a deep neural network (DNN) framework for modeling the separation function (SF) of an SSS system. By training a DNN to predict clean sound features of a target sound from corresponding multichannel deteriorated sound feature inputs, we enable the DNN to model the SF for extracting the target sound without prior knowledge regarding the acoustic properties of the surrounding environment. Moreover, the same DNN is trained to function simultaneously as a NR filter. Our proposed SSS system is evaluated using an isolated word recognition task and a large vocabulary continuous speech recognition task when either nondirectional or directional noise is accumulated in the target speech. Our evaluation results demonstrate that DNN performs noticeably better than the baseline approach, especially when directional noise is accumulated with a low signal-to-noise ratio.

    Original languageEnglish
    Title of host publicationIEEE-RAS International Conference on Humanoid Robots
    PublisherIEEE Computer Society
    Pages389-394
    Number of pages6
    Volume2015-December
    ISBN (Print)9781479968855
    DOIs
    Publication statusPublished - 2015 Dec 22
    Event15th IEEE RAS International Conference on Humanoid Robots, Humanoids 2015 - Seoul, Korea, Republic of
    Duration: 2015 Nov 32015 Nov 5

    Other

    Other15th IEEE RAS International Conference on Humanoid Robots, Humanoids 2015
    CountryKorea, Republic of
    CitySeoul
    Period15/11/315/11/5

    Fingerprint

    Source separation
    Audition
    Acoustic waves
    Robots
    Noise abatement
    Speech recognition
    Acoustic noise
    Continuous speech recognition
    Deep learning
    Acoustic properties
    Independent component analysis
    Principal component analysis
    Scalability
    Signal to noise ratio
    Deep neural networks

    Keywords

    • Feature extraction
    • Microphones
    • Neural networks
    • Robots
    • Speech
    • Speech recognition
    • Training

    ASJC Scopus subject areas

    • Artificial Intelligence
    • Computer Vision and Pattern Recognition
    • Hardware and Architecture
    • Human-Computer Interaction
    • Electrical and Electronic Engineering

    Cite this

    Noda, K., Hashimoto, N., Nakadai, K., & Ogata, T. (2015). Sound source separation for robot audition using deep learning. In IEEE-RAS International Conference on Humanoid Robots (Vol. 2015-December, pp. 389-394). [7363579] IEEE Computer Society. https://doi.org/10.1109/HUMANOIDS.2015.7363579

    Sound source separation for robot audition using deep learning. / Noda, Kuniaki; Hashimoto, Naoya; Nakadai, Kazuhiro; Ogata, Tetsuya.

    IEEE-RAS International Conference on Humanoid Robots. Vol. 2015-December IEEE Computer Society, 2015. p. 389-394 7363579.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Noda, K, Hashimoto, N, Nakadai, K & Ogata, T 2015, Sound source separation for robot audition using deep learning. in IEEE-RAS International Conference on Humanoid Robots. vol. 2015-December, 7363579, IEEE Computer Society, pp. 389-394, 15th IEEE RAS International Conference on Humanoid Robots, Humanoids 2015, Seoul, Korea, Republic of, 15/11/3. https://doi.org/10.1109/HUMANOIDS.2015.7363579
    Noda K, Hashimoto N, Nakadai K, Ogata T. Sound source separation for robot audition using deep learning. In IEEE-RAS International Conference on Humanoid Robots. Vol. 2015-December. IEEE Computer Society. 2015. p. 389-394. 7363579 https://doi.org/10.1109/HUMANOIDS.2015.7363579
    Noda, Kuniaki ; Hashimoto, Naoya ; Nakadai, Kazuhiro ; Ogata, Tetsuya. / Sound source separation for robot audition using deep learning. IEEE-RAS International Conference on Humanoid Robots. Vol. 2015-December IEEE Computer Society, 2015. pp. 389-394
    @inproceedings{9d775fe0f672444d8cadf834bde7e462,
    title = "Sound source separation for robot audition using deep learning",
    abstract = "Noise robust speech recognition is crucial for effective human-machine interaction in real-world environments. Sound source separation (SSS) is one of the most widely used approaches for addressing noise robust speech recognition by extracting a target speaker's speech signal while suppressing simultaneous unintended signals. However, conventional SSS algorithms, such as independent component analysis or nonlinear principal component analysis, are limited in modeling complex projections with scalability. Moreover, conventional systems required designing an independent subsystem for noise reduction (NR) in addition to the SSS. To overcome these issues, we propose a deep neural network (DNN) framework for modeling the separation function (SF) of an SSS system. By training a DNN to predict clean sound features of a target sound from corresponding multichannel deteriorated sound feature inputs, we enable the DNN to model the SF for extracting the target sound without prior knowledge regarding the acoustic properties of the surrounding environment. Moreover, the same DNN is trained to function simultaneously as a NR filter. Our proposed SSS system is evaluated using an isolated word recognition task and a large vocabulary continuous speech recognition task when either nondirectional or directional noise is accumulated in the target speech. Our evaluation results demonstrate that DNN performs noticeably better than the baseline approach, especially when directional noise is accumulated with a low signal-to-noise ratio.",
    keywords = "Feature extraction, Microphones, Neural networks, Robots, Speech, Speech recognition, Training",
    author = "Kuniaki Noda and Naoya Hashimoto and Kazuhiro Nakadai and Tetsuya Ogata",
    year = "2015",
    month = "12",
    day = "22",
    doi = "10.1109/HUMANOIDS.2015.7363579",
    language = "English",
    isbn = "9781479968855",
    volume = "2015-December",
    pages = "389--394",
    booktitle = "IEEE-RAS International Conference on Humanoid Robots",
    publisher = "IEEE Computer Society",

    }

    TY - GEN

    T1 - Sound source separation for robot audition using deep learning

    AU - Noda, Kuniaki

    AU - Hashimoto, Naoya

    AU - Nakadai, Kazuhiro

    AU - Ogata, Tetsuya

    PY - 2015/12/22

    Y1 - 2015/12/22

    N2 - Noise robust speech recognition is crucial for effective human-machine interaction in real-world environments. Sound source separation (SSS) is one of the most widely used approaches for addressing noise robust speech recognition by extracting a target speaker's speech signal while suppressing simultaneous unintended signals. However, conventional SSS algorithms, such as independent component analysis or nonlinear principal component analysis, are limited in modeling complex projections with scalability. Moreover, conventional systems required designing an independent subsystem for noise reduction (NR) in addition to the SSS. To overcome these issues, we propose a deep neural network (DNN) framework for modeling the separation function (SF) of an SSS system. By training a DNN to predict clean sound features of a target sound from corresponding multichannel deteriorated sound feature inputs, we enable the DNN to model the SF for extracting the target sound without prior knowledge regarding the acoustic properties of the surrounding environment. Moreover, the same DNN is trained to function simultaneously as a NR filter. Our proposed SSS system is evaluated using an isolated word recognition task and a large vocabulary continuous speech recognition task when either nondirectional or directional noise is accumulated in the target speech. Our evaluation results demonstrate that DNN performs noticeably better than the baseline approach, especially when directional noise is accumulated with a low signal-to-noise ratio.

    AB - Noise robust speech recognition is crucial for effective human-machine interaction in real-world environments. Sound source separation (SSS) is one of the most widely used approaches for addressing noise robust speech recognition by extracting a target speaker's speech signal while suppressing simultaneous unintended signals. However, conventional SSS algorithms, such as independent component analysis or nonlinear principal component analysis, are limited in modeling complex projections with scalability. Moreover, conventional systems required designing an independent subsystem for noise reduction (NR) in addition to the SSS. To overcome these issues, we propose a deep neural network (DNN) framework for modeling the separation function (SF) of an SSS system. By training a DNN to predict clean sound features of a target sound from corresponding multichannel deteriorated sound feature inputs, we enable the DNN to model the SF for extracting the target sound without prior knowledge regarding the acoustic properties of the surrounding environment. Moreover, the same DNN is trained to function simultaneously as a NR filter. Our proposed SSS system is evaluated using an isolated word recognition task and a large vocabulary continuous speech recognition task when either nondirectional or directional noise is accumulated in the target speech. Our evaluation results demonstrate that DNN performs noticeably better than the baseline approach, especially when directional noise is accumulated with a low signal-to-noise ratio.

    KW - Feature extraction

    KW - Microphones

    KW - Neural networks

    KW - Robots

    KW - Speech

    KW - Speech recognition

    KW - Training

    UR - http://www.scopus.com/inward/record.url?scp=84962323571&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84962323571&partnerID=8YFLogxK

    U2 - 10.1109/HUMANOIDS.2015.7363579

    DO - 10.1109/HUMANOIDS.2015.7363579

    M3 - Conference contribution

    SN - 9781479968855

    VL - 2015-December

    SP - 389

    EP - 394

    BT - IEEE-RAS International Conference on Humanoid Robots

    PB - IEEE Computer Society

    ER -