TY - GEN
T1 - ESPnet-ST IWSLT 2021 Offline Speech Translation System
AU - Inaguma, Hirofumi
AU - Yan, Brian
AU - Dalmia, Siddharth
AU - Guo, Pengcheng
AU - Shi, Jiatong
AU - Duh, Kevin
AU - Watanabe, Shinji
N1 - Funding Information:
This work was partly supported by ASAPP and JHU HLTCOE. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) (Towns et al., 2014), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges system (Nystrom et al., 2015), which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).
Publisher Copyright:
© 2021 Association for Computational Linguistics.
PY - 2021
Y1 - 2021
N2 - This paper describes the ESPnet-ST group’s IWSLT 2021 submission in the offline speech translation track. This year we made various efforts on training data, architecture, and audio segmentation. On the data side, we investigated sequence-level knowledge distillation (SeqKD) for end-to-end (E2E) speech translation. Specifically, we used multi-referenced SeqKD from multiple teachers trained on different amounts of bitext. On the architecture side, we adopted the Conformer encoder and the Multi-Decoder architecture, which equips dedicated decoders for speech recognition and translation tasks in a unified encoder-decoder model and enables search in both source and target language spaces during inference. We also significantly improved audio segmentation by using the pyannote.audio toolkit and merging multiple short segments for long context modeling. Experimental evaluations showed that each of them contributed to large improvements in translation performance. Our best E2E system combined all the above techniques with model ensembling and achieved 31.4 BLEU on the 2-ref of tst2021 and 21.2 BLEU and 19.3 BLEU on the two single references of tst2021.
AB - This paper describes the ESPnet-ST group’s IWSLT 2021 submission in the offline speech translation track. This year we made various efforts on training data, architecture, and audio segmentation. On the data side, we investigated sequence-level knowledge distillation (SeqKD) for end-to-end (E2E) speech translation. Specifically, we used multi-referenced SeqKD from multiple teachers trained on different amounts of bitext. On the architecture side, we adopted the Conformer encoder and the Multi-Decoder architecture, which equips dedicated decoders for speech recognition and translation tasks in a unified encoder-decoder model and enables search in both source and target language spaces during inference. We also significantly improved audio segmentation by using the pyannote.audio toolkit and merging multiple short segments for long context modeling. Experimental evaluations showed that each of them contributed to large improvements in translation performance. Our best E2E system combined all the above techniques with model ensembling and achieved 31.4 BLEU on the 2-ref of tst2021 and 21.2 BLEU and 19.3 BLEU on the two single references of tst2021.
UR - http://www.scopus.com/inward/record.url?scp=85115728607&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85115728607&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85115728607
T3 - IWSLT 2021 - 18th International Conference on Spoken Language Translation, Proceedings
SP - 100
EP - 109
BT - IWSLT 2021 - 18th International Conference on Spoken Language Translation, Proceedings
PB - Association for Computational Linguistics (ACL)
T2 - 18th International Conference on Spoken Language Translation, IWSLT 2021
Y2 - 5 August 2021 through 6 August 2021
ER -