Memory-Efficient Training of RNN-Transducer with Sampled Softmax

Jaesong Lee, Lukas Lee, Shinji Watanabe

Research output: Contribution to journalConference articlepeer-review

1 Citation (Scopus)


RNN-Transducer has been one of promising architectures for end-to-end automatic speech recognition. Although RNN-Transducer has many advantages including its strong accuracy and streaming-friendly property, its high memory consumption during training has been a critical problem for development. In this work, we propose to apply sampled softmax to RNN-Transducer, which requires only a small subset of vocabulary during training thus saves its memory consumption. We further extend sampled softmax to optimize memory consumption for a minibatch, and employ distributions of auxiliary CTC losses for sampling vocabulary to improve model accuracy. We present experimental results on LibriSpeech, AISHELL-1, and CSJ-APS, where sampled softmax greatly reduces memory consumption and still maintains the accuracy of the baseline model.

Original languageEnglish
Pages (from-to)4441-4445
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Publication statusPublished - 2022
Externally publishedYes
Event23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of
Duration: 2022 Sept 182022 Sept 22


  • auxiliary CTC loss
  • end-to-end speech recognition
  • RNN-Transducer
  • sampled softmax

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation


Dive into the research topics of 'Memory-Efficient Training of RNN-Transducer with Sampled Softmax'. Together they form a unique fingerprint.

Cite this