Multiple index combination for Japanese spoken term detection with optimum index selection based on OOV-region classifier

Naoyuki Kanda, Katsutoshi Itoyama, Hiroshi G. Okuno

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

In this paper, a novel index combination method for spoken term detection is proposed. In our method, outputs from four different recognizers (word, syllable, word-syllable, and fragment recognizer) are combined into one confusion network. A novel index-selection method for the multiple index-combination method is then used to suppress the increase of the index size. Two methods are proposed to reduce index size: (1) arc selection and (2) unit selection, both of which are based on an OOV-region classifier score. Experimental results with 39 hours of Japanese lecture recordings showed that the index-selection method achieved a 22% reduction of index size of the best confusion network while maintaining its high accuracy. Compared with the best phoneme-based index from a single recognizer, the proposed method achieved a 25.0% and 14.8% relative error reduction for IV and OOV queries without increasing the index size.

Original languageEnglish
Title of host publicationICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Pages8540-8544
Number of pages5
DOIs
Publication statusPublished - 2013 Oct 18
Externally publishedYes
Event2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013 - Vancouver, BC
Duration: 2013 May 262013 May 31

Other

Other2013 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2013
CityVancouver, BC
Period13/5/2613/5/31

Keywords

  • keyword spotting
  • out-of-vocabulary detection
  • Spoken term detection

ASJC Scopus subject areas

  • Signal Processing
  • Software
  • Electrical and Electronic Engineering

Fingerprint Dive into the research topics of 'Multiple index combination for Japanese spoken term detection with optimum index selection based on OOV-region classifier'. Together they form a unique fingerprint.

Cite this