SpeakBySinging: Converting singing voices to speaking voices while retaining voice timbre

Shimpei Aso, Takeshi Saitou, Masataka Goto, Katsutoshi Itoyama, Toru Takahashi, Kazunori Komatani, Tetsuya Ogata, Hiroshi G. Okuno

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

This paper describes a singing-to-speaking synthesis system called "SpeakBySinging" that can synthesize a speaking voice from an input singing voice and the song lyrics. The system controls three acoustic features that determine the difference between speaking and singing voices: the fundamental frequency (F0), phoneme duration, and power (volume). By changing these features of a singing voice, the system synthesizes a speaking voice while retaining the timbre of the singing voice. The system first analyzes the singing voice to extract the F0 contour, the duration of each phoneme of the lyrics, and the power. These features are then converted to target values that are obtained by feeding the lyrics into a traditional text-to-speech (TTS) system. The system finally generates a speaking voice that preserves the timbre of the singing voice but has speech-like features. Experimental results show that SpeakBySinging can convert singing voices into speaking voices whose timbre is almost the same as the original singing voices.

Original languageEnglish
Title of host publication13th International Conference on Digital Audio Effects, DAFx 2010 Proceedings
Publication statusPublished - 2010
Externally publishedYes
Event13th International Conference on Digital Audio Effects, DAFx 2010 - Graz
Duration: 2010 Sep 62010 Sep 10

Other

Other13th International Conference on Digital Audio Effects, DAFx 2010
CityGraz
Period10/9/610/9/10

Fingerprint

Speech recognition
Acoustics
Control systems

ASJC Scopus subject areas

  • Signal Processing

Cite this

Aso, S., Saitou, T., Goto, M., Itoyama, K., Takahashi, T., Komatani, K., ... Okuno, H. G. (2010). SpeakBySinging: Converting singing voices to speaking voices while retaining voice timbre. In 13th International Conference on Digital Audio Effects, DAFx 2010 Proceedings

SpeakBySinging : Converting singing voices to speaking voices while retaining voice timbre. / Aso, Shimpei; Saitou, Takeshi; Goto, Masataka; Itoyama, Katsutoshi; Takahashi, Toru; Komatani, Kazunori; Ogata, Tetsuya; Okuno, Hiroshi G.

13th International Conference on Digital Audio Effects, DAFx 2010 Proceedings. 2010.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Aso, S, Saitou, T, Goto, M, Itoyama, K, Takahashi, T, Komatani, K, Ogata, T & Okuno, HG 2010, SpeakBySinging: Converting singing voices to speaking voices while retaining voice timbre. in 13th International Conference on Digital Audio Effects, DAFx 2010 Proceedings. 13th International Conference on Digital Audio Effects, DAFx 2010, Graz, 10/9/6.
Aso S, Saitou T, Goto M, Itoyama K, Takahashi T, Komatani K et al. SpeakBySinging: Converting singing voices to speaking voices while retaining voice timbre. In 13th International Conference on Digital Audio Effects, DAFx 2010 Proceedings. 2010
Aso, Shimpei ; Saitou, Takeshi ; Goto, Masataka ; Itoyama, Katsutoshi ; Takahashi, Toru ; Komatani, Kazunori ; Ogata, Tetsuya ; Okuno, Hiroshi G. / SpeakBySinging : Converting singing voices to speaking voices while retaining voice timbre. 13th International Conference on Digital Audio Effects, DAFx 2010 Proceedings. 2010.
@inproceedings{32a39f649e3b46a7994217a51180a79a,
title = "SpeakBySinging: Converting singing voices to speaking voices while retaining voice timbre",
abstract = "This paper describes a singing-to-speaking synthesis system called {"}SpeakBySinging{"} that can synthesize a speaking voice from an input singing voice and the song lyrics. The system controls three acoustic features that determine the difference between speaking and singing voices: the fundamental frequency (F0), phoneme duration, and power (volume). By changing these features of a singing voice, the system synthesizes a speaking voice while retaining the timbre of the singing voice. The system first analyzes the singing voice to extract the F0 contour, the duration of each phoneme of the lyrics, and the power. These features are then converted to target values that are obtained by feeding the lyrics into a traditional text-to-speech (TTS) system. The system finally generates a speaking voice that preserves the timbre of the singing voice but has speech-like features. Experimental results show that SpeakBySinging can convert singing voices into speaking voices whose timbre is almost the same as the original singing voices.",
author = "Shimpei Aso and Takeshi Saitou and Masataka Goto and Katsutoshi Itoyama and Toru Takahashi and Kazunori Komatani and Tetsuya Ogata and Okuno, {Hiroshi G.}",
year = "2010",
language = "English",
isbn = "9783200019409",
booktitle = "13th International Conference on Digital Audio Effects, DAFx 2010 Proceedings",

}

TY - GEN

T1 - SpeakBySinging

T2 - Converting singing voices to speaking voices while retaining voice timbre

AU - Aso, Shimpei

AU - Saitou, Takeshi

AU - Goto, Masataka

AU - Itoyama, Katsutoshi

AU - Takahashi, Toru

AU - Komatani, Kazunori

AU - Ogata, Tetsuya

AU - Okuno, Hiroshi G.

PY - 2010

Y1 - 2010

N2 - This paper describes a singing-to-speaking synthesis system called "SpeakBySinging" that can synthesize a speaking voice from an input singing voice and the song lyrics. The system controls three acoustic features that determine the difference between speaking and singing voices: the fundamental frequency (F0), phoneme duration, and power (volume). By changing these features of a singing voice, the system synthesizes a speaking voice while retaining the timbre of the singing voice. The system first analyzes the singing voice to extract the F0 contour, the duration of each phoneme of the lyrics, and the power. These features are then converted to target values that are obtained by feeding the lyrics into a traditional text-to-speech (TTS) system. The system finally generates a speaking voice that preserves the timbre of the singing voice but has speech-like features. Experimental results show that SpeakBySinging can convert singing voices into speaking voices whose timbre is almost the same as the original singing voices.

AB - This paper describes a singing-to-speaking synthesis system called "SpeakBySinging" that can synthesize a speaking voice from an input singing voice and the song lyrics. The system controls three acoustic features that determine the difference between speaking and singing voices: the fundamental frequency (F0), phoneme duration, and power (volume). By changing these features of a singing voice, the system synthesizes a speaking voice while retaining the timbre of the singing voice. The system first analyzes the singing voice to extract the F0 contour, the duration of each phoneme of the lyrics, and the power. These features are then converted to target values that are obtained by feeding the lyrics into a traditional text-to-speech (TTS) system. The system finally generates a speaking voice that preserves the timbre of the singing voice but has speech-like features. Experimental results show that SpeakBySinging can convert singing voices into speaking voices whose timbre is almost the same as the original singing voices.

UR - http://www.scopus.com/inward/record.url?scp=84872734987&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84872734987&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84872734987

SN - 9783200019409

BT - 13th International Conference on Digital Audio Effects, DAFx 2010 Proceedings

ER -