Abstract
This paper proposes streaming automatic speech recognition (ASR) with re-blocking processing based on integrated voice activity detection (VAD). End-to-end (E2E) ASR models are promising for practical ASR. One of the key issues in realizing such a system is the detection of voice segments to cope with streaming input. There are three challenges for speech segmentation in streaming applications: 1) the extra VAD module in addition to the ASR model increases the system complexity and the number of parameters, 2) inappropriate segmentation of speech for block-based streaming methods deteriorates the performance, 3) non-voice segments that are not discarded results in the increase of unnecessary computational costs. This paper proposes a model that integrates a VAD branch into a block processing-based streaming ASR system and a re-blocking technique to avoid inappropriate isolation of the utterances. Experiments show that the proposed method reduces the detection error rate (ER) by 25.8% on the AMI dataset with a less than 1% of increase in the number of parameters. Furthermore, the proposed method show 7.5% relative improvement in character error rate (CER) on the CSJ dataset with 27.3% reduction in real-time factor (RTF).
Original language | English |
---|---|
Pages (from-to) | 4641-4645 |
Number of pages | 5 |
Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Volume | 2022-September |
DOIs | |
Publication status | Published - 2022 |
Externally published | Yes |
Event | 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022 - Incheon, Korea, Republic of Duration: 2022 Sept 18 → 2022 Sept 22 |
ASJC Scopus subject areas
- Language and Linguistics
- Human-Computer Interaction
- Signal Processing
- Software
- Modelling and Simulation