An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech

Hiroshi Seki, Shinji Watanabe, Takaaki Hori, Jonathan Le Roux, John R. Hershey

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity to build a monolithic multilingual ASR system with a language-independent neural network architecture. In our previous work, we proposed a monolithic neural network architecture that can recognize multiple languages, and showed its effectiveness compared with conventional language-dependent models. However, the model is not guaranteed to properly handle switches in language within an utterance, thus lacking the flexibility to recognize mixed-language speech such as code-switching. In this paper, we extend our model to enable dynamic tracking of the language within an utterance, and propose a training procedure that takes advantage of a newly created mixed-language speech corpus. Experimental results show that the extended model outperforms both language-dependent models and our previous model without suffering from performance degradation that could be associated with language switching.

Original languageEnglish
Title of host publication2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4919-4923
Number of pages5
Volume2018-April
ISBN (Print)9781538646588
DOIs
Publication statusPublished - 2018 Sep 10
Externally publishedYes
Event2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Calgary, Canada
Duration: 2018 Apr 152018 Apr 20

Other

Other2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018
CountryCanada
CityCalgary
Period18/4/1518/4/20

Fingerprint

Speech recognition
Network architecture
Neural networks
Glossaries
Linguistics
Switches
Degradation

Keywords

  • End-to-end ASR
  • Hybrid attention/CTC
  • Language identification
  • Language-independent architecture
  • Multilingual ASR

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Seki, H., Watanabe, S., Hori, T., Roux, J. L., & Hershey, J. R. (2018). An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech. In 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings (Vol. 2018-April, pp. 4919-4923). [8462180] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2018.8462180

An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech. / Seki, Hiroshi; Watanabe, Shinji; Hori, Takaaki; Roux, Jonathan Le; Hershey, John R.

2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. Vol. 2018-April Institute of Electrical and Electronics Engineers Inc., 2018. p. 4919-4923 8462180.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Seki, H, Watanabe, S, Hori, T, Roux, JL & Hershey, JR 2018, An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech. in 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. vol. 2018-April, 8462180, Institute of Electrical and Electronics Engineers Inc., pp. 4919-4923, 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018, Calgary, Canada, 18/4/15. https://doi.org/10.1109/ICASSP.2018.8462180
Seki H, Watanabe S, Hori T, Roux JL, Hershey JR. An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech. In 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. Vol. 2018-April. Institute of Electrical and Electronics Engineers Inc. 2018. p. 4919-4923. 8462180 https://doi.org/10.1109/ICASSP.2018.8462180
Seki, Hiroshi ; Watanabe, Shinji ; Hori, Takaaki ; Roux, Jonathan Le ; Hershey, John R. / An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech. 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings. Vol. 2018-April Institute of Electrical and Electronics Engineers Inc., 2018. pp. 4919-4923
@inproceedings{4cfaefd3614a4640b90087439fca5abb,
title = "An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech",
abstract = "End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity to build a monolithic multilingual ASR system with a language-independent neural network architecture. In our previous work, we proposed a monolithic neural network architecture that can recognize multiple languages, and showed its effectiveness compared with conventional language-dependent models. However, the model is not guaranteed to properly handle switches in language within an utterance, thus lacking the flexibility to recognize mixed-language speech such as code-switching. In this paper, we extend our model to enable dynamic tracking of the language within an utterance, and propose a training procedure that takes advantage of a newly created mixed-language speech corpus. Experimental results show that the extended model outperforms both language-dependent models and our previous model without suffering from performance degradation that could be associated with language switching.",
keywords = "End-to-end ASR, Hybrid attention/CTC, Language identification, Language-independent architecture, Multilingual ASR",
author = "Hiroshi Seki and Shinji Watanabe and Takaaki Hori and Roux, {Jonathan Le} and Hershey, {John R.}",
year = "2018",
month = "9",
day = "10",
doi = "10.1109/ICASSP.2018.8462180",
language = "English",
isbn = "9781538646588",
volume = "2018-April",
pages = "4919--4923",
booktitle = "2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - An End-to-End Language-Tracking Speech Recognizer for Mixed-Language Speech

AU - Seki, Hiroshi

AU - Watanabe, Shinji

AU - Hori, Takaaki

AU - Roux, Jonathan Le

AU - Hershey, John R.

PY - 2018/9/10

Y1 - 2018/9/10

N2 - End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity to build a monolithic multilingual ASR system with a language-independent neural network architecture. In our previous work, we proposed a monolithic neural network architecture that can recognize multiple languages, and showed its effectiveness compared with conventional language-dependent models. However, the model is not guaranteed to properly handle switches in language within an utterance, thus lacking the flexibility to recognize mixed-language speech such as code-switching. In this paper, we extend our model to enable dynamic tracking of the language within an utterance, and propose a training procedure that takes advantage of a newly created mixed-language speech corpus. Experimental results show that the extended model outperforms both language-dependent models and our previous model without suffering from performance degradation that could be associated with language switching.

AB - End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity to build a monolithic multilingual ASR system with a language-independent neural network architecture. In our previous work, we proposed a monolithic neural network architecture that can recognize multiple languages, and showed its effectiveness compared with conventional language-dependent models. However, the model is not guaranteed to properly handle switches in language within an utterance, thus lacking the flexibility to recognize mixed-language speech such as code-switching. In this paper, we extend our model to enable dynamic tracking of the language within an utterance, and propose a training procedure that takes advantage of a newly created mixed-language speech corpus. Experimental results show that the extended model outperforms both language-dependent models and our previous model without suffering from performance degradation that could be associated with language switching.

KW - End-to-end ASR

KW - Hybrid attention/CTC

KW - Language identification

KW - Language-independent architecture

KW - Multilingual ASR

UR - http://www.scopus.com/inward/record.url?scp=85054263334&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054263334&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2018.8462180

DO - 10.1109/ICASSP.2018.8462180

M3 - Conference contribution

AN - SCOPUS:85054263334

SN - 9781538646588

VL - 2018-April

SP - 4919

EP - 4923

BT - 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2018 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -