A purely end-to-end system for multi-speaker speech recognition

Hiroshi Seki, Takaaki Hori, Shinji Watanabe, Jonathan Le Roux, John R. Hershey

研究成果: Conference contribution

9 引用 (Scopus)

抄録

Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.

元の言語English
ホスト出版物のタイトルACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
出版者Association for Computational Linguistics (ACL)
ページ2620-2630
ページ数11
ISBN(電子版)9781948087322
出版物ステータスPublished - 2018 1 1
外部発表Yes
イベント56th Annual Meeting of the Association for Computational Linguistics, ACL 2018 - Melbourne, Australia
継続期間: 2018 7 152018 7 20

出版物シリーズ

名前ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)
1

Conference

Conference56th Annual Meeting of the Association for Computational Linguistics, ACL 2018
Australia
Melbourne
期間18/7/1518/7/20

Fingerprint

Speech recognition
Labels
Source separation

ASJC Scopus subject areas

  • Software
  • Computational Theory and Mathematics

これを引用

Seki, H., Hori, T., Watanabe, S., Le Roux, J., & Hershey, J. R. (2018). A purely end-to-end system for multi-speaker speech recognition. : ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers) (pp. 2620-2630). (ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers); 巻数 1). Association for Computational Linguistics (ACL).

A purely end-to-end system for multi-speaker speech recognition. / Seki, Hiroshi; Hori, Takaaki; Watanabe, Shinji; Le Roux, Jonathan; Hershey, John R.

ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). Association for Computational Linguistics (ACL), 2018. p. 2620-2630 (ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers); 巻 1).

研究成果: Conference contribution

Seki, H, Hori, T, Watanabe, S, Le Roux, J & Hershey, JR 2018, A purely end-to-end system for multi-speaker speech recognition. : ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), 巻. 1, Association for Computational Linguistics (ACL), pp. 2620-2630, 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, 18/7/15.
Seki H, Hori T, Watanabe S, Le Roux J, Hershey JR. A purely end-to-end system for multi-speaker speech recognition. : ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). Association for Computational Linguistics (ACL). 2018. p. 2620-2630. (ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)).
Seki, Hiroshi ; Hori, Takaaki ; Watanabe, Shinji ; Le Roux, Jonathan ; Hershey, John R. / A purely end-to-end system for multi-speaker speech recognition. ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers). Association for Computational Linguistics (ACL), 2018. pp. 2620-2630 (ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)).
@inproceedings{7b765a2aa5594b449e781d6dba8b34e9,
title = "A purely end-to-end system for multi-speaker speech recognition",
abstract = "Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1{\%} relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.",
author = "Hiroshi Seki and Takaaki Hori and Shinji Watanabe and {Le Roux}, Jonathan and Hershey, {John R.}",
year = "2018",
month = "1",
day = "1",
language = "English",
series = "ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)",
publisher = "Association for Computational Linguistics (ACL)",
pages = "2620--2630",
booktitle = "ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)",

}

TY - GEN

T1 - A purely end-to-end system for multi-speaker speech recognition

AU - Seki, Hiroshi

AU - Hori, Takaaki

AU - Watanabe, Shinji

AU - Le Roux, Jonathan

AU - Hershey, John R.

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.

AB - Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.

UR - http://www.scopus.com/inward/record.url?scp=85063098725&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063098725&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85063098725

T3 - ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)

SP - 2620

EP - 2630

BT - ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers)

PB - Association for Computational Linguistics (ACL)

ER -