Structural classification methods based on weighted finite-state transducers for automatic speech recognition

Yotaro Kubo, Shinji Watanabe, Takaaki Hori, Atsushi Nakamura

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

The potential of structural classification methods for automatic speech recognition (ASR) has been attracting the speech community since they can realize the unified modeling of acoustic and linguistic aspects of recognizers. However, the structural classification approaches involve well-known tradeoffs between the richness of features and the computational efficiency of decoders. If we are to employ, for example, a frame-synchronous one-pass decoding technique, features considered to calculate the likelihood of each hypothesis must be restricted to the same form as the conventional acoustic and language models. This paper tackles this limitation directly by exploiting the structure of the weighted finite-state transducers (WFSTs) used for decoding. Although WFST arcs provide rich contextual information, close integration with a computationally efficient decoding technique is still possible since most decoding techniques only require that their likelihood functions are factorizable for each decoder arc and time frame. In this paper, we compare two methods for structural classification with the WFST-based features; the structured perceptron and conditional random field (CRF) techniques. To analyze the advantages of these two classifiers, we present experimental results for the TIMIT continuous phoneme recognition task, the WSJ transcription task, and the MIT lecture transcription task. We confirmed that the proposed approach improved the ASR performance without sacrificing the computational efficiency of the decoders, even though the baseline systems are already trained with discriminative training techniques (e.g., MPE).

Original languageEnglish
Article number6198870
Pages (from-to)2240-2251
Number of pages12
JournalIEEE Transactions on Audio, Speech and Language Processing
Volume20
Issue number8
DOIs
Publication statusPublished - 2012
Externally publishedYes

Fingerprint

speech recognition
decoding
Speech recognition
Decoding
decoders
Transducers
transducers
Transcription
Computational efficiency
arcs
Acoustics
phonemes
linguistics
self organizing systems
acoustics
lectures
tradeoffs
classifiers
Linguistics
Classifiers

Keywords

  • Automatic speech recognition (ASR)
  • structural classification
  • weighted finite-state transducers (WFST)

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Acoustics and Ultrasonics

Cite this

Structural classification methods based on weighted finite-state transducers for automatic speech recognition. / Kubo, Yotaro; Watanabe, Shinji; Hori, Takaaki; Nakamura, Atsushi.

In: IEEE Transactions on Audio, Speech and Language Processing, Vol. 20, No. 8, 6198870, 2012, p. 2240-2251.

Research output: Contribution to journalArticle

@article{5e344fcbcb8c40eabb213da1d3bc71ed,
title = "Structural classification methods based on weighted finite-state transducers for automatic speech recognition",
abstract = "The potential of structural classification methods for automatic speech recognition (ASR) has been attracting the speech community since they can realize the unified modeling of acoustic and linguistic aspects of recognizers. However, the structural classification approaches involve well-known tradeoffs between the richness of features and the computational efficiency of decoders. If we are to employ, for example, a frame-synchronous one-pass decoding technique, features considered to calculate the likelihood of each hypothesis must be restricted to the same form as the conventional acoustic and language models. This paper tackles this limitation directly by exploiting the structure of the weighted finite-state transducers (WFSTs) used for decoding. Although WFST arcs provide rich contextual information, close integration with a computationally efficient decoding technique is still possible since most decoding techniques only require that their likelihood functions are factorizable for each decoder arc and time frame. In this paper, we compare two methods for structural classification with the WFST-based features; the structured perceptron and conditional random field (CRF) techniques. To analyze the advantages of these two classifiers, we present experimental results for the TIMIT continuous phoneme recognition task, the WSJ transcription task, and the MIT lecture transcription task. We confirmed that the proposed approach improved the ASR performance without sacrificing the computational efficiency of the decoders, even though the baseline systems are already trained with discriminative training techniques (e.g., MPE).",
keywords = "Automatic speech recognition (ASR), structural classification, weighted finite-state transducers (WFST)",
author = "Yotaro Kubo and Shinji Watanabe and Takaaki Hori and Atsushi Nakamura",
year = "2012",
doi = "10.1109/TASL.2012.2199112",
language = "English",
volume = "20",
pages = "2240--2251",
journal = "IEEE Transactions on Speech and Audio Processing",
issn = "1558-7916",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "8",

}

TY - JOUR

T1 - Structural classification methods based on weighted finite-state transducers for automatic speech recognition

AU - Kubo, Yotaro

AU - Watanabe, Shinji

AU - Hori, Takaaki

AU - Nakamura, Atsushi

PY - 2012

Y1 - 2012

N2 - The potential of structural classification methods for automatic speech recognition (ASR) has been attracting the speech community since they can realize the unified modeling of acoustic and linguistic aspects of recognizers. However, the structural classification approaches involve well-known tradeoffs between the richness of features and the computational efficiency of decoders. If we are to employ, for example, a frame-synchronous one-pass decoding technique, features considered to calculate the likelihood of each hypothesis must be restricted to the same form as the conventional acoustic and language models. This paper tackles this limitation directly by exploiting the structure of the weighted finite-state transducers (WFSTs) used for decoding. Although WFST arcs provide rich contextual information, close integration with a computationally efficient decoding technique is still possible since most decoding techniques only require that their likelihood functions are factorizable for each decoder arc and time frame. In this paper, we compare two methods for structural classification with the WFST-based features; the structured perceptron and conditional random field (CRF) techniques. To analyze the advantages of these two classifiers, we present experimental results for the TIMIT continuous phoneme recognition task, the WSJ transcription task, and the MIT lecture transcription task. We confirmed that the proposed approach improved the ASR performance without sacrificing the computational efficiency of the decoders, even though the baseline systems are already trained with discriminative training techniques (e.g., MPE).

AB - The potential of structural classification methods for automatic speech recognition (ASR) has been attracting the speech community since they can realize the unified modeling of acoustic and linguistic aspects of recognizers. However, the structural classification approaches involve well-known tradeoffs between the richness of features and the computational efficiency of decoders. If we are to employ, for example, a frame-synchronous one-pass decoding technique, features considered to calculate the likelihood of each hypothesis must be restricted to the same form as the conventional acoustic and language models. This paper tackles this limitation directly by exploiting the structure of the weighted finite-state transducers (WFSTs) used for decoding. Although WFST arcs provide rich contextual information, close integration with a computationally efficient decoding technique is still possible since most decoding techniques only require that their likelihood functions are factorizable for each decoder arc and time frame. In this paper, we compare two methods for structural classification with the WFST-based features; the structured perceptron and conditional random field (CRF) techniques. To analyze the advantages of these two classifiers, we present experimental results for the TIMIT continuous phoneme recognition task, the WSJ transcription task, and the MIT lecture transcription task. We confirmed that the proposed approach improved the ASR performance without sacrificing the computational efficiency of the decoders, even though the baseline systems are already trained with discriminative training techniques (e.g., MPE).

KW - Automatic speech recognition (ASR)

KW - structural classification

KW - weighted finite-state transducers (WFST)

UR - http://www.scopus.com/inward/record.url?scp=84865227975&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84865227975&partnerID=8YFLogxK

U2 - 10.1109/TASL.2012.2199112

DO - 10.1109/TASL.2012.2199112

M3 - Article

VL - 20

SP - 2240

EP - 2251

JO - IEEE Transactions on Speech and Audio Processing

JF - IEEE Transactions on Speech and Audio Processing

SN - 1558-7916

IS - 8

M1 - 6198870

ER -