Speech synthesis for conversational news contents delivery

Hiroaki Takatsu, Ishin Fukuoka, Shinya Fujie, Kazuhiko Iwata, Tetsunori Kobayashi

Research output: Contribution to journalArticle

Abstract

We have been developing a speech-based “news-delivery system”, which can transmit news contents via spoken dialogues. In such a system, a speech synthesis sub system that can flexibly adjust the prosodic features in utterances is highly vital: the system should be able to highlight spoken phrases containing noteworthy information in an article; it should also provide properly controlled pauses between utterances to facilitate user’s interactive reactions including questions. To achieve these goals, we have decided to incorporate the position of the utterance in the paragraph and the role of the utterance in the discourse structure into the bundle of features for speech synthesis. These features were found to be crucially important in fulfilling the above-mentioned requirements for the spoken utterances by the thorough investigation into the news-telling speech data uttered by a voice actress. Specifically, these features dictate the importance of information carried by spoken phrases, and hence should be effectively utilized in synthesizing prosodically adequate utterances. Based on these investigations, we devised a deep neural network-based speech synthesis model that takes as input the role and position features. In addition, we designed a neural network model that can estimate an adequate pause length between utterances. Experimental results showed that by adding these features to the input, it becomes more proper speech for information delivery. Furthermore, we confirmed that by inserting pauses properly, it becomes easier for users to ask questions during system utterances.

Original languageEnglish
Article numberB-I65_1-15
JournalTransactions of the Japanese Society for Artificial Intelligence
Volume34
Issue number2
DOIs
Publication statusPublished - 2019 Jan 1

Fingerprint

Speech synthesis
Neural networks

Keywords

  • Conversational speech synthesis
  • DNN-based speech synthesis
  • Paragraph-based speech synthesis
  • Pause length estimation
  • Prominence

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence

Cite this

Speech synthesis for conversational news contents delivery. / Takatsu, Hiroaki; Fukuoka, Ishin; Fujie, Shinya; Iwata, Kazuhiko; Kobayashi, Tetsunori.

In: Transactions of the Japanese Society for Artificial Intelligence, Vol. 34, No. 2, B-I65_1-15, 01.01.2019.

Research output: Contribution to journalArticle

@article{134cf74eea1341dab50e220ff32e7348,
title = "Speech synthesis for conversational news contents delivery",
abstract = "We have been developing a speech-based “news-delivery system”, which can transmit news contents via spoken dialogues. In such a system, a speech synthesis sub system that can flexibly adjust the prosodic features in utterances is highly vital: the system should be able to highlight spoken phrases containing noteworthy information in an article; it should also provide properly controlled pauses between utterances to facilitate user’s interactive reactions including questions. To achieve these goals, we have decided to incorporate the position of the utterance in the paragraph and the role of the utterance in the discourse structure into the bundle of features for speech synthesis. These features were found to be crucially important in fulfilling the above-mentioned requirements for the spoken utterances by the thorough investigation into the news-telling speech data uttered by a voice actress. Specifically, these features dictate the importance of information carried by spoken phrases, and hence should be effectively utilized in synthesizing prosodically adequate utterances. Based on these investigations, we devised a deep neural network-based speech synthesis model that takes as input the role and position features. In addition, we designed a neural network model that can estimate an adequate pause length between utterances. Experimental results showed that by adding these features to the input, it becomes more proper speech for information delivery. Furthermore, we confirmed that by inserting pauses properly, it becomes easier for users to ask questions during system utterances.",
keywords = "Conversational speech synthesis, DNN-based speech synthesis, Paragraph-based speech synthesis, Pause length estimation, Prominence",
author = "Hiroaki Takatsu and Ishin Fukuoka and Shinya Fujie and Kazuhiko Iwata and Tetsunori Kobayashi",
year = "2019",
month = "1",
day = "1",
doi = "10.1527/tjsai.B-I65",
language = "English",
volume = "34",
journal = "Transactions of the Japanese Society for Artificial Intelligence",
issn = "1346-0714",
publisher = "Japanese Society for Artificial Intelligence",
number = "2",

}

TY - JOUR

T1 - Speech synthesis for conversational news contents delivery

AU - Takatsu, Hiroaki

AU - Fukuoka, Ishin

AU - Fujie, Shinya

AU - Iwata, Kazuhiko

AU - Kobayashi, Tetsunori

PY - 2019/1/1

Y1 - 2019/1/1

N2 - We have been developing a speech-based “news-delivery system”, which can transmit news contents via spoken dialogues. In such a system, a speech synthesis sub system that can flexibly adjust the prosodic features in utterances is highly vital: the system should be able to highlight spoken phrases containing noteworthy information in an article; it should also provide properly controlled pauses between utterances to facilitate user’s interactive reactions including questions. To achieve these goals, we have decided to incorporate the position of the utterance in the paragraph and the role of the utterance in the discourse structure into the bundle of features for speech synthesis. These features were found to be crucially important in fulfilling the above-mentioned requirements for the spoken utterances by the thorough investigation into the news-telling speech data uttered by a voice actress. Specifically, these features dictate the importance of information carried by spoken phrases, and hence should be effectively utilized in synthesizing prosodically adequate utterances. Based on these investigations, we devised a deep neural network-based speech synthesis model that takes as input the role and position features. In addition, we designed a neural network model that can estimate an adequate pause length between utterances. Experimental results showed that by adding these features to the input, it becomes more proper speech for information delivery. Furthermore, we confirmed that by inserting pauses properly, it becomes easier for users to ask questions during system utterances.

AB - We have been developing a speech-based “news-delivery system”, which can transmit news contents via spoken dialogues. In such a system, a speech synthesis sub system that can flexibly adjust the prosodic features in utterances is highly vital: the system should be able to highlight spoken phrases containing noteworthy information in an article; it should also provide properly controlled pauses between utterances to facilitate user’s interactive reactions including questions. To achieve these goals, we have decided to incorporate the position of the utterance in the paragraph and the role of the utterance in the discourse structure into the bundle of features for speech synthesis. These features were found to be crucially important in fulfilling the above-mentioned requirements for the spoken utterances by the thorough investigation into the news-telling speech data uttered by a voice actress. Specifically, these features dictate the importance of information carried by spoken phrases, and hence should be effectively utilized in synthesizing prosodically adequate utterances. Based on these investigations, we devised a deep neural network-based speech synthesis model that takes as input the role and position features. In addition, we designed a neural network model that can estimate an adequate pause length between utterances. Experimental results showed that by adding these features to the input, it becomes more proper speech for information delivery. Furthermore, we confirmed that by inserting pauses properly, it becomes easier for users to ask questions during system utterances.

KW - Conversational speech synthesis

KW - DNN-based speech synthesis

KW - Paragraph-based speech synthesis

KW - Pause length estimation

KW - Prominence

UR - http://www.scopus.com/inward/record.url?scp=85065556732&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85065556732&partnerID=8YFLogxK

U2 - 10.1527/tjsai.B-I65

DO - 10.1527/tjsai.B-I65

M3 - Article

VL - 34

JO - Transactions of the Japanese Society for Artificial Intelligence

JF - Transactions of the Japanese Society for Artificial Intelligence

SN - 1346-0714

IS - 2

M1 - B-I65_1-15

ER -