Language independent end-to-end architecture for joint language identification and speech recognition

Shinji Watanabe, Takaaki Hori, John R. Hershey

Research output: Chapter in Book/Report/Conference proceedingConference contribution

21 Citations (Scopus)

Abstract

End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity, which we fully exploit in this paper, to build a monolithic multilingual ASR system with a language-independent neural network architecture. We present a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition. The model is based on our hybrid attention/connectionist temporal classification (CTC) architecture which has previously been shown to achieve the state-of-the-art performance in several ASR benchmarks. Here we augment its set of output symbols to include the union of character sets appearing in all the target languages. These include Roman and Cyrillic Alphabets, Arabic numbers, simplified Chinese, and Japanese Kanji/Hiragana/Katakana characters (5,500 characters in all). This allows training of a single multilingual model, whose parameters are shared across all the languages. The model can jointly identify the language and recognize the speech, automatically formatting the recognized text in the appropriate character set. The experiments, which used speech databases composed of Wall Street Journal (English), Corpus of Spontaneous Japanese, HKUST Mandarin CTS, and Voxforge (German, Spanish, French, Italian, Dutch, Portuguese, Russian), demonstrate comparable/superior performance relative to language-dependent end-to-end ASR systems.

Original languageEnglish
Title of host publication2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages265-271
Number of pages7
Volume2018-January
ISBN (Electronic)9781509047888
DOIs
Publication statusPublished - 2018 Jan 24
Externally publishedYes
Event2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Okinawa, Japan
Duration: 2017 Dec 162017 Dec 20

Other

Other2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017
CountryJapan
CityOkinawa
Period17/12/1617/12/20

Fingerprint

Speech recognition
Character sets
Glossaries
Network architecture
Linguistics
Neural networks
Experiments

Keywords

  • End-to-end ASR
  • hybrid attention/CTC
  • language identification
  • language-independent architecture
  • multilingual ASR

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction

Cite this

Watanabe, S., Hori, T., & Hershey, J. R. (2018). Language independent end-to-end architecture for joint language identification and speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings (Vol. 2018-January, pp. 265-271). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASRU.2017.8268945

Language independent end-to-end architecture for joint language identification and speech recognition. / Watanabe, Shinji; Hori, Takaaki; Hershey, John R.

2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Vol. 2018-January Institute of Electrical and Electronics Engineers Inc., 2018. p. 265-271.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Watanabe, S, Hori, T & Hershey, JR 2018, Language independent end-to-end architecture for joint language identification and speech recognition. in 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. vol. 2018-January, Institute of Electrical and Electronics Engineers Inc., pp. 265-271, 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017, Okinawa, Japan, 17/12/16. https://doi.org/10.1109/ASRU.2017.8268945
Watanabe S, Hori T, Hershey JR. Language independent end-to-end architecture for joint language identification and speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Vol. 2018-January. Institute of Electrical and Electronics Engineers Inc. 2018. p. 265-271 https://doi.org/10.1109/ASRU.2017.8268945
Watanabe, Shinji ; Hori, Takaaki ; Hershey, John R. / Language independent end-to-end architecture for joint language identification and speech recognition. 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings. Vol. 2018-January Institute of Electrical and Electronics Engineers Inc., 2018. pp. 265-271
@inproceedings{93a203270a2f4b0daeed4fa179823a41,
title = "Language independent end-to-end architecture for joint language identification and speech recognition",
abstract = "End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity, which we fully exploit in this paper, to build a monolithic multilingual ASR system with a language-independent neural network architecture. We present a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition. The model is based on our hybrid attention/connectionist temporal classification (CTC) architecture which has previously been shown to achieve the state-of-the-art performance in several ASR benchmarks. Here we augment its set of output symbols to include the union of character sets appearing in all the target languages. These include Roman and Cyrillic Alphabets, Arabic numbers, simplified Chinese, and Japanese Kanji/Hiragana/Katakana characters (5,500 characters in all). This allows training of a single multilingual model, whose parameters are shared across all the languages. The model can jointly identify the language and recognize the speech, automatically formatting the recognized text in the appropriate character set. The experiments, which used speech databases composed of Wall Street Journal (English), Corpus of Spontaneous Japanese, HKUST Mandarin CTS, and Voxforge (German, Spanish, French, Italian, Dutch, Portuguese, Russian), demonstrate comparable/superior performance relative to language-dependent end-to-end ASR systems.",
keywords = "End-to-end ASR, hybrid attention/CTC, language identification, language-independent architecture, multilingual ASR",
author = "Shinji Watanabe and Takaaki Hori and Hershey, {John R.}",
year = "2018",
month = "1",
day = "24",
doi = "10.1109/ASRU.2017.8268945",
language = "English",
volume = "2018-January",
pages = "265--271",
booktitle = "2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Language independent end-to-end architecture for joint language identification and speech recognition

AU - Watanabe, Shinji

AU - Hori, Takaaki

AU - Hershey, John R.

PY - 2018/1/24

Y1 - 2018/1/24

N2 - End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity, which we fully exploit in this paper, to build a monolithic multilingual ASR system with a language-independent neural network architecture. We present a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition. The model is based on our hybrid attention/connectionist temporal classification (CTC) architecture which has previously been shown to achieve the state-of-the-art performance in several ASR benchmarks. Here we augment its set of output symbols to include the union of character sets appearing in all the target languages. These include Roman and Cyrillic Alphabets, Arabic numbers, simplified Chinese, and Japanese Kanji/Hiragana/Katakana characters (5,500 characters in all). This allows training of a single multilingual model, whose parameters are shared across all the languages. The model can jointly identify the language and recognize the speech, automatically formatting the recognized text in the appropriate character set. The experiments, which used speech databases composed of Wall Street Journal (English), Corpus of Spontaneous Japanese, HKUST Mandarin CTS, and Voxforge (German, Spanish, French, Italian, Dutch, Portuguese, Russian), demonstrate comparable/superior performance relative to language-dependent end-to-end ASR systems.

AB - End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity, which we fully exploit in this paper, to build a monolithic multilingual ASR system with a language-independent neural network architecture. We present a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition. The model is based on our hybrid attention/connectionist temporal classification (CTC) architecture which has previously been shown to achieve the state-of-the-art performance in several ASR benchmarks. Here we augment its set of output symbols to include the union of character sets appearing in all the target languages. These include Roman and Cyrillic Alphabets, Arabic numbers, simplified Chinese, and Japanese Kanji/Hiragana/Katakana characters (5,500 characters in all). This allows training of a single multilingual model, whose parameters are shared across all the languages. The model can jointly identify the language and recognize the speech, automatically formatting the recognized text in the appropriate character set. The experiments, which used speech databases composed of Wall Street Journal (English), Corpus of Spontaneous Japanese, HKUST Mandarin CTS, and Voxforge (German, Spanish, French, Italian, Dutch, Portuguese, Russian), demonstrate comparable/superior performance relative to language-dependent end-to-end ASR systems.

KW - End-to-end ASR

KW - hybrid attention/CTC

KW - language identification

KW - language-independent architecture

KW - multilingual ASR

UR - http://www.scopus.com/inward/record.url?scp=85050560111&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050560111&partnerID=8YFLogxK

U2 - 10.1109/ASRU.2017.8268945

DO - 10.1109/ASRU.2017.8268945

M3 - Conference contribution

AN - SCOPUS:85050560111

VL - 2018-January

SP - 265

EP - 271

BT - 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -