Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling

Hainan Xu, Shuoyang DIng, Shinji Watanabe

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Most end-to-end speech recognition systems model text directly as a sequence of characters or sub-words. Current approaches to sub-word extraction only consider character sequence frequencies, which at times produce inferior sub-word segmentation that might lead to erroneous speech recognition output. We propose pronunciation-assisted sub-word modeling (PASM), a sub-word extraction method that leverages the pronunciation information of a word. Experiments show that the proposed method can greatly improve upon the character-based baseline, and also outperform commonly used byte-pair encoding methods.

Original languageEnglish
Title of host publication2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages7110-7114
Number of pages5
ISBN (Electronic)9781479981311
DOIs
Publication statusPublished - 2019 May 1
Event44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Brighton, United Kingdom
Duration: 2019 May 122019 May 17

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2019-May
ISSN (Print)1520-6149

Conference

Conference44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
CountryUnited Kingdom
CityBrighton
Period19/5/1219/5/17

Fingerprint

Speech recognition
Experiments

Keywords

  • end-to-end models
  • speech recognition
  • sub-word modeling

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Xu, H., DIng, S., & Watanabe, S. (2019). Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings (pp. 7110-7114). [8682494] (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2019.8682494

Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling. / Xu, Hainan; DIng, Shuoyang; Watanabe, Shinji.

2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. p. 7110-7114 8682494 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Xu, H, DIng, S & Watanabe, S 2019, Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling. in 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings., 8682494, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May, Institute of Electrical and Electronics Engineers Inc., pp. 7110-7114, 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019, Brighton, United Kingdom, 19/5/12. https://doi.org/10.1109/ICASSP.2019.8682494
Xu H, DIng S, Watanabe S. Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. p. 7110-7114. 8682494. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2019.8682494
Xu, Hainan ; DIng, Shuoyang ; Watanabe, Shinji. / Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling. 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 7110-7114 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).
@inproceedings{887de62071f740f48d36bbb7aa5991d7,
title = "Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling",
abstract = "Most end-to-end speech recognition systems model text directly as a sequence of characters or sub-words. Current approaches to sub-word extraction only consider character sequence frequencies, which at times produce inferior sub-word segmentation that might lead to erroneous speech recognition output. We propose pronunciation-assisted sub-word modeling (PASM), a sub-word extraction method that leverages the pronunciation information of a word. Experiments show that the proposed method can greatly improve upon the character-based baseline, and also outperform commonly used byte-pair encoding methods.",
keywords = "end-to-end models, speech recognition, sub-word modeling",
author = "Hainan Xu and Shuoyang DIng and Shinji Watanabe",
year = "2019",
month = "5",
day = "1",
doi = "10.1109/ICASSP.2019.8682494",
language = "English",
series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "7110--7114",
booktitle = "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings",

}

TY - GEN

T1 - Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling

AU - Xu, Hainan

AU - DIng, Shuoyang

AU - Watanabe, Shinji

PY - 2019/5/1

Y1 - 2019/5/1

N2 - Most end-to-end speech recognition systems model text directly as a sequence of characters or sub-words. Current approaches to sub-word extraction only consider character sequence frequencies, which at times produce inferior sub-word segmentation that might lead to erroneous speech recognition output. We propose pronunciation-assisted sub-word modeling (PASM), a sub-word extraction method that leverages the pronunciation information of a word. Experiments show that the proposed method can greatly improve upon the character-based baseline, and also outperform commonly used byte-pair encoding methods.

AB - Most end-to-end speech recognition systems model text directly as a sequence of characters or sub-words. Current approaches to sub-word extraction only consider character sequence frequencies, which at times produce inferior sub-word segmentation that might lead to erroneous speech recognition output. We propose pronunciation-assisted sub-word modeling (PASM), a sub-word extraction method that leverages the pronunciation information of a word. Experiments show that the proposed method can greatly improve upon the character-based baseline, and also outperform commonly used byte-pair encoding methods.

KW - end-to-end models

KW - speech recognition

KW - sub-word modeling

UR - http://www.scopus.com/inward/record.url?scp=85068978138&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068978138&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2019.8682494

DO - 10.1109/ICASSP.2019.8682494

M3 - Conference contribution

AN - SCOPUS:85068978138

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 7110

EP - 7114

BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -