Prosody control of utterance sequence for information delivering

Research output: Contribution to journalConference article

1 Citation (Scopus)

Abstract

We propose a conversational speech synthesis system in which the prosodic features of each utterance are controlled throughout the entire input text. We have developed a "news-telling system," which delivered news articles through spoken language. The speech synthesis system for the news-telling should be able to highlight utterances containing noteworthy information in the article with a particular way of speaking so as to impress them on the users. To achieve this, we introduced role and position features of the individual utterances in the article into the control parameters for prosody generation throughout the text. We defined three categories for the role feature: a nucleus (which is assigned to the utterance including the noteworthy information), a front satellite (which precedes the nucleus) and a rear satellite (which follows the nucleus). We investigated how the prosodic features differed depending on the role and position features through an analysis of news-telling speech data uttered by a voice actress. We designed the speech synthesis system on the basis of a deep neural network having the role and position features added to its input layer. Objective and subjective evaluation results showed that introducing those features was effective in the speech synthesis for the information delivering.

Original languageEnglish
Pages (from-to)774-778
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2017-August
DOIs
Publication statusPublished - 2017 Jan 1
Event18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017 - Stockholm, Sweden
Duration: 2017 Aug 202017 Aug 24

Fingerprint

Prosody
Speech Synthesis
Speech synthesis
Nucleus
Satellites
Subjective Evaluation
Control Parameter
Entire
Neural Networks
Utterance
News
Text

Keywords

  • Conversational speech
  • Discourse analysis
  • Neural network
  • Prosody
  • Speech synthesis

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

@article{9d621b786a2548d7a20b615dc852ac1e,
title = "Prosody control of utterance sequence for information delivering",
abstract = "We propose a conversational speech synthesis system in which the prosodic features of each utterance are controlled throughout the entire input text. We have developed a {"}news-telling system,{"} which delivered news articles through spoken language. The speech synthesis system for the news-telling should be able to highlight utterances containing noteworthy information in the article with a particular way of speaking so as to impress them on the users. To achieve this, we introduced role and position features of the individual utterances in the article into the control parameters for prosody generation throughout the text. We defined three categories for the role feature: a nucleus (which is assigned to the utterance including the noteworthy information), a front satellite (which precedes the nucleus) and a rear satellite (which follows the nucleus). We investigated how the prosodic features differed depending on the role and position features through an analysis of news-telling speech data uttered by a voice actress. We designed the speech synthesis system on the basis of a deep neural network having the role and position features added to its input layer. Objective and subjective evaluation results showed that introducing those features was effective in the speech synthesis for the information delivering.",
keywords = "Conversational speech, Discourse analysis, Neural network, Prosody, Speech synthesis",
author = "Ishin Fukuoka and Kazuhiko Iwata and Tetsunori Kobayashi",
year = "2017",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2017-708",
language = "English",
volume = "2017-August",
pages = "774--778",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Prosody control of utterance sequence for information delivering

AU - Fukuoka, Ishin

AU - Iwata, Kazuhiko

AU - Kobayashi, Tetsunori

PY - 2017/1/1

Y1 - 2017/1/1

N2 - We propose a conversational speech synthesis system in which the prosodic features of each utterance are controlled throughout the entire input text. We have developed a "news-telling system," which delivered news articles through spoken language. The speech synthesis system for the news-telling should be able to highlight utterances containing noteworthy information in the article with a particular way of speaking so as to impress them on the users. To achieve this, we introduced role and position features of the individual utterances in the article into the control parameters for prosody generation throughout the text. We defined three categories for the role feature: a nucleus (which is assigned to the utterance including the noteworthy information), a front satellite (which precedes the nucleus) and a rear satellite (which follows the nucleus). We investigated how the prosodic features differed depending on the role and position features through an analysis of news-telling speech data uttered by a voice actress. We designed the speech synthesis system on the basis of a deep neural network having the role and position features added to its input layer. Objective and subjective evaluation results showed that introducing those features was effective in the speech synthesis for the information delivering.

AB - We propose a conversational speech synthesis system in which the prosodic features of each utterance are controlled throughout the entire input text. We have developed a "news-telling system," which delivered news articles through spoken language. The speech synthesis system for the news-telling should be able to highlight utterances containing noteworthy information in the article with a particular way of speaking so as to impress them on the users. To achieve this, we introduced role and position features of the individual utterances in the article into the control parameters for prosody generation throughout the text. We defined three categories for the role feature: a nucleus (which is assigned to the utterance including the noteworthy information), a front satellite (which precedes the nucleus) and a rear satellite (which follows the nucleus). We investigated how the prosodic features differed depending on the role and position features through an analysis of news-telling speech data uttered by a voice actress. We designed the speech synthesis system on the basis of a deep neural network having the role and position features added to its input layer. Objective and subjective evaluation results showed that introducing those features was effective in the speech synthesis for the information delivering.

KW - Conversational speech

KW - Discourse analysis

KW - Neural network

KW - Prosody

KW - Speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=85039148476&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85039148476&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2017-708

DO - 10.21437/Interspeech.2017-708

M3 - Conference article

AN - SCOPUS:85039148476

VL - 2017-August

SP - 774

EP - 778

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -