ESPNet: End-to-end speech processing toolkit

Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, Tsubasa Ochiai

Research output: Contribution to journalConference article

57 Citations (Scopus)

Abstract

This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.

Original languageEnglish
Pages (from-to)2207-2211
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
Publication statusPublished - 2018 Jan 1
Externally publishedYes
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: 2018 Sep 22018 Sep 6

Fingerprint

Speech Processing
Speech processing
Automatic Speech Recognition
Speech recognition
Open Source
Dynamic Neural Networks
Speech Recognition
Differentiate
Feature Extraction
Engine
Feature extraction
Benchmark
Software
Toolkit
Engines
Experimental Results
Neural networks
Experiment
Experiments

Keywords

  • Dynamical neural network
  • End-to-end
  • Kaldi
  • Open source software
  • Speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

ESPNet : End-to-end speech processing toolkit. / Watanabe, Shinji; Hori, Takaaki; Karita, Shigeki; Hayashi, Tomoki; Nishitoba, Jiro; Unno, Yuya; Soplin, Nelson Enrique Yalta; Heymann, Jahn; Wiesner, Matthew; Chen, Nanxin; Renduchintala, Adithya; Ochiai, Tsubasa.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2018-September, 01.01.2018, p. 2207-2211.

Research output: Contribution to journalConference article

Watanabe, S, Hori, T, Karita, S, Hayashi, T, Nishitoba, J, Unno, Y, Soplin, NEY, Heymann, J, Wiesner, M, Chen, N, Renduchintala, A & Ochiai, T 2018, 'ESPNet: End-to-end speech processing toolkit', Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2018-September, pp. 2207-2211. https://doi.org/10.21437/Interspeech.2018-1456
Watanabe, Shinji ; Hori, Takaaki ; Karita, Shigeki ; Hayashi, Tomoki ; Nishitoba, Jiro ; Unno, Yuya ; Soplin, Nelson Enrique Yalta ; Heymann, Jahn ; Wiesner, Matthew ; Chen, Nanxin ; Renduchintala, Adithya ; Ochiai, Tsubasa. / ESPNet : End-to-end speech processing toolkit. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2018 ; Vol. 2018-September. pp. 2207-2211.
@article{a91e2e35e06c43c9a9a3781e3a3ec130,
title = "ESPNet: End-to-end speech processing toolkit",
abstract = "This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.",
keywords = "Dynamical neural network, End-to-end, Kaldi, Open source software, Speech recognition",
author = "Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Soplin, {Nelson Enrique Yalta} and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai",
year = "2018",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2018-1456",
language = "English",
volume = "2018-September",
pages = "2207--2211",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - ESPNet

T2 - End-to-end speech processing toolkit

AU - Watanabe, Shinji

AU - Hori, Takaaki

AU - Karita, Shigeki

AU - Hayashi, Tomoki

AU - Nishitoba, Jiro

AU - Unno, Yuya

AU - Soplin, Nelson Enrique Yalta

AU - Heymann, Jahn

AU - Wiesner, Matthew

AU - Chen, Nanxin

AU - Renduchintala, Adithya

AU - Ochiai, Tsubasa

PY - 2018/1/1

Y1 - 2018/1/1

N2 - This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.

AB - This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.

KW - Dynamical neural network

KW - End-to-end

KW - Kaldi

KW - Open source software

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85054997993&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054997993&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-1456

DO - 10.21437/Interspeech.2018-1456

M3 - Conference article

AN - SCOPUS:85054997993

VL - 2018-September

SP - 2207

EP - 2211

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -