Speech Enhancement Based on Bayesian Low-Rank and Sparse Decomposition of Multichannel Magnitude Spectrograms

Yoshiaki Bando, Katsutoshi Itoyama, Masashi Konyo, Satoshi Tadokoro, Kazuhiro Nakadai, Kazuyoshi Yoshii, Tatsuya Kawahara, Hiroshi G. Okuno

    Research output: Contribution to journalArticle

    6 Citations (Scopus)

    Abstract

    This paper presents a blind multichannel speech enhancement method that can deal with the time-varying layout of microphones and sound sources. Since nonnegative tensor factorization (NTF) separates a multichannelmagnitude (or power) spectrogram into source spectrograms without phase information, it is robust against the time-varying mixing system. This method, however, requires prior information such as the spectral bases (templates) of each source spectrogram in advance. To solve this problem, we develop a Bayesian model called robust NTF (Bayesian RNTF) that decomposes a multichannel magnitude spectrogram into target speech and noise spectrograms based on their sparseness and low rankness. Bayesian RNTF is applied to the challenging task of speech enhancement for a microphone array distributed on a hose-shaped rescue robot. When the robot searches for victims under collapsed buildings, the layout of themicrophones changes over time and some of them often fail to capture target speech. Our method robustly works under such situations, thanks to its characteristic of time-varying mixing system. Experiments using a 3-m hose-shaped rescue robot with eight microphones show that the proposed method outperforms conventional blind methods in enhancement performance by the signal-to-noise ratio of 1.03 dB.

    Original languageEnglish
    Pages (from-to)215-230
    Number of pages16
    JournalIEEE/ACM Transactions on Audio Speech and Language Processing
    Volume26
    Issue number2
    DOIs
    Publication statusPublished - 2018 Feb 1

    Fingerprint

    Speech Enhancement
    Spectrogram
    Speech enhancement
    spectrograms
    Microphones
    Hose
    Robots
    Decomposition
    Factorization
    decomposition
    Decompose
    robot
    microphones
    robots
    Tensors
    augmentation
    hoses
    Time-varying
    Robot
    Bayesian Model

    Keywords

    • Bayesian signal processing
    • low-rank and sparse decomposition
    • Multichannel speech enhancement

    ASJC Scopus subject areas

    • Signal Processing
    • Media Technology
    • Instrumentation
    • Acoustics and Ultrasonics
    • Linguistics and Language
    • Electrical and Electronic Engineering
    • Speech and Hearing

    Cite this

    Speech Enhancement Based on Bayesian Low-Rank and Sparse Decomposition of Multichannel Magnitude Spectrograms. / Bando, Yoshiaki; Itoyama, Katsutoshi; Konyo, Masashi; Tadokoro, Satoshi; Nakadai, Kazuhiro; Yoshii, Kazuyoshi; Kawahara, Tatsuya; Okuno, Hiroshi G.

    In: IEEE/ACM Transactions on Audio Speech and Language Processing, Vol. 26, No. 2, 01.02.2018, p. 215-230.

    Research output: Contribution to journalArticle

    Bando, Yoshiaki ; Itoyama, Katsutoshi ; Konyo, Masashi ; Tadokoro, Satoshi ; Nakadai, Kazuhiro ; Yoshii, Kazuyoshi ; Kawahara, Tatsuya ; Okuno, Hiroshi G. / Speech Enhancement Based on Bayesian Low-Rank and Sparse Decomposition of Multichannel Magnitude Spectrograms. In: IEEE/ACM Transactions on Audio Speech and Language Processing. 2018 ; Vol. 26, No. 2. pp. 215-230.
    @article{c53be4deb24e44a4843adf1818c3979c,
    title = "Speech Enhancement Based on Bayesian Low-Rank and Sparse Decomposition of Multichannel Magnitude Spectrograms",
    abstract = "This paper presents a blind multichannel speech enhancement method that can deal with the time-varying layout of microphones and sound sources. Since nonnegative tensor factorization (NTF) separates a multichannelmagnitude (or power) spectrogram into source spectrograms without phase information, it is robust against the time-varying mixing system. This method, however, requires prior information such as the spectral bases (templates) of each source spectrogram in advance. To solve this problem, we develop a Bayesian model called robust NTF (Bayesian RNTF) that decomposes a multichannel magnitude spectrogram into target speech and noise spectrograms based on their sparseness and low rankness. Bayesian RNTF is applied to the challenging task of speech enhancement for a microphone array distributed on a hose-shaped rescue robot. When the robot searches for victims under collapsed buildings, the layout of themicrophones changes over time and some of them often fail to capture target speech. Our method robustly works under such situations, thanks to its characteristic of time-varying mixing system. Experiments using a 3-m hose-shaped rescue robot with eight microphones show that the proposed method outperforms conventional blind methods in enhancement performance by the signal-to-noise ratio of 1.03 dB.",
    keywords = "Bayesian signal processing, low-rank and sparse decomposition, Multichannel speech enhancement",
    author = "Yoshiaki Bando and Katsutoshi Itoyama and Masashi Konyo and Satoshi Tadokoro and Kazuhiro Nakadai and Kazuyoshi Yoshii and Tatsuya Kawahara and Okuno, {Hiroshi G.}",
    year = "2018",
    month = "2",
    day = "1",
    doi = "10.1109/TASLP.2017.2772340",
    language = "English",
    volume = "26",
    pages = "215--230",
    journal = "IEEE/ACM Transactions on Speech and Language Processing",
    issn = "2329-9290",
    publisher = "IEEE Advancing Technology for Humanity",
    number = "2",

    }

    TY - JOUR

    T1 - Speech Enhancement Based on Bayesian Low-Rank and Sparse Decomposition of Multichannel Magnitude Spectrograms

    AU - Bando, Yoshiaki

    AU - Itoyama, Katsutoshi

    AU - Konyo, Masashi

    AU - Tadokoro, Satoshi

    AU - Nakadai, Kazuhiro

    AU - Yoshii, Kazuyoshi

    AU - Kawahara, Tatsuya

    AU - Okuno, Hiroshi G.

    PY - 2018/2/1

    Y1 - 2018/2/1

    N2 - This paper presents a blind multichannel speech enhancement method that can deal with the time-varying layout of microphones and sound sources. Since nonnegative tensor factorization (NTF) separates a multichannelmagnitude (or power) spectrogram into source spectrograms without phase information, it is robust against the time-varying mixing system. This method, however, requires prior information such as the spectral bases (templates) of each source spectrogram in advance. To solve this problem, we develop a Bayesian model called robust NTF (Bayesian RNTF) that decomposes a multichannel magnitude spectrogram into target speech and noise spectrograms based on their sparseness and low rankness. Bayesian RNTF is applied to the challenging task of speech enhancement for a microphone array distributed on a hose-shaped rescue robot. When the robot searches for victims under collapsed buildings, the layout of themicrophones changes over time and some of them often fail to capture target speech. Our method robustly works under such situations, thanks to its characteristic of time-varying mixing system. Experiments using a 3-m hose-shaped rescue robot with eight microphones show that the proposed method outperforms conventional blind methods in enhancement performance by the signal-to-noise ratio of 1.03 dB.

    AB - This paper presents a blind multichannel speech enhancement method that can deal with the time-varying layout of microphones and sound sources. Since nonnegative tensor factorization (NTF) separates a multichannelmagnitude (or power) spectrogram into source spectrograms without phase information, it is robust against the time-varying mixing system. This method, however, requires prior information such as the spectral bases (templates) of each source spectrogram in advance. To solve this problem, we develop a Bayesian model called robust NTF (Bayesian RNTF) that decomposes a multichannel magnitude spectrogram into target speech and noise spectrograms based on their sparseness and low rankness. Bayesian RNTF is applied to the challenging task of speech enhancement for a microphone array distributed on a hose-shaped rescue robot. When the robot searches for victims under collapsed buildings, the layout of themicrophones changes over time and some of them often fail to capture target speech. Our method robustly works under such situations, thanks to its characteristic of time-varying mixing system. Experiments using a 3-m hose-shaped rescue robot with eight microphones show that the proposed method outperforms conventional blind methods in enhancement performance by the signal-to-noise ratio of 1.03 dB.

    KW - Bayesian signal processing

    KW - low-rank and sparse decomposition

    KW - Multichannel speech enhancement

    UR - http://www.scopus.com/inward/record.url?scp=85034267108&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85034267108&partnerID=8YFLogxK

    U2 - 10.1109/TASLP.2017.2772340

    DO - 10.1109/TASLP.2017.2772340

    M3 - Article

    AN - SCOPUS:85034267108

    VL - 26

    SP - 215

    EP - 230

    JO - IEEE/ACM Transactions on Speech and Language Processing

    JF - IEEE/ACM Transactions on Speech and Language Processing

    SN - 2329-9290

    IS - 2

    ER -