High Quality Synthetic Speech Generation Using Synchronized Oscillators

Kenji Hashimoto, Takemi Mochida, Yasuaki Sato, Tetsunori Kobayashi, Katsuhiko Shirai

    研究成果: Article

    3 引用 (Scopus)

    抄録

    For the production of high quality synthetic sounds in a text-to-speech system, an excellent synthesizing method of speech signals is indispensable. In this paper, a new speech analysis-synthesis method for the text-to-speech system is proposed. The signals of voiced speed, which have a line spectrum structure at intervals of pitch in the linear frequency domain, can be represented approximately by the superposition of sinusoidal waves. In our system, analysis and synthesis are performed using such a harmonic structure of the signals of voiced speech. In the analysis phase, assuming an exact harmonic structure model at intervals of pitch against the fine structure of the short-time power spectrum, the fundamental frequency fo is decided so as to minimize the error of the log-power spectrum at each peak position. At the same time, according to the value of the above minimized error, the rate of periodicity of the speech signal is determined. Then the log-power spectrum envelope is represented by the cosine-series interpolating the data which are sampled at every pitch period. In the synthesis phase, numerical solutions of non-linear differential equations which generate sinusoidal waves are used. For voiced sounds, those equations behave as a group of mutually synchronized oscillators. These sinusoidal waves are superposed so as to reconstruct the line spectrum structure. For voiceless sounds, those non-linear differential equations work as passive filters with input noise sources. Our system has some characteristics as follows. (1) Voiced and voiceless sounds can be treated in a same framework. (2) Since the phase and the power information of each sinusoidal wave can be easily controlled, if necessary, periodic waveforms in the voiced sounds can be precisely reproduced in the time domain. (3) The fundamental frequency fo and phoneme duration can be easily changed without much degradation of original sound quality.

    元の言語English
    ページ(範囲)1949-1956
    ページ数8
    ジャーナルIEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
    E76-A
    発行部数11
    出版物ステータスPublished - 1993 11

    Fingerprint

    Acoustic waves
    Power spectrum
    Power Spectrum
    Text-to-speech
    Fundamental Frequency
    Speech Signal
    Synthesis
    Nonlinear Differential Equations
    Differential equations
    Harmonic
    Speech Analysis
    Passive filters
    Interval
    Speech analysis
    Line
    Fine Structure
    Model structures
    Systems Analysis
    Waveform
    Periodicity

    ASJC Scopus subject areas

    • Hardware and Architecture
    • Information Systems
    • Electrical and Electronic Engineering

    これを引用

    High Quality Synthetic Speech Generation Using Synchronized Oscillators. / Hashimoto, Kenji; Mochida, Takemi; Sato, Yasuaki; Kobayashi, Tetsunori; Shirai, Katsuhiko.

    :: IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 巻 E76-A, 番号 11, 11.1993, p. 1949-1956.

    研究成果: Article

    Hashimoto, Kenji ; Mochida, Takemi ; Sato, Yasuaki ; Kobayashi, Tetsunori ; Shirai, Katsuhiko. / High Quality Synthetic Speech Generation Using Synchronized Oscillators. :: IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences. 1993 ; 巻 E76-A, 番号 11. pp. 1949-1956.
    @article{b8e6269126cc4e33b039f00df5c5fdbe,
    title = "High Quality Synthetic Speech Generation Using Synchronized Oscillators",
    abstract = "For the production of high quality synthetic sounds in a text-to-speech system, an excellent synthesizing method of speech signals is indispensable. In this paper, a new speech analysis-synthesis method for the text-to-speech system is proposed. The signals of voiced speed, which have a line spectrum structure at intervals of pitch in the linear frequency domain, can be represented approximately by the superposition of sinusoidal waves. In our system, analysis and synthesis are performed using such a harmonic structure of the signals of voiced speech. In the analysis phase, assuming an exact harmonic structure model at intervals of pitch against the fine structure of the short-time power spectrum, the fundamental frequency fo is decided so as to minimize the error of the log-power spectrum at each peak position. At the same time, according to the value of the above minimized error, the rate of periodicity of the speech signal is determined. Then the log-power spectrum envelope is represented by the cosine-series interpolating the data which are sampled at every pitch period. In the synthesis phase, numerical solutions of non-linear differential equations which generate sinusoidal waves are used. For voiced sounds, those equations behave as a group of mutually synchronized oscillators. These sinusoidal waves are superposed so as to reconstruct the line spectrum structure. For voiceless sounds, those non-linear differential equations work as passive filters with input noise sources. Our system has some characteristics as follows. (1) Voiced and voiceless sounds can be treated in a same framework. (2) Since the phase and the power information of each sinusoidal wave can be easily controlled, if necessary, periodic waveforms in the voiced sounds can be precisely reproduced in the time domain. (3) The fundamental frequency fo and phoneme duration can be easily changed without much degradation of original sound quality.",
    author = "Kenji Hashimoto and Takemi Mochida and Yasuaki Sato and Tetsunori Kobayashi and Katsuhiko Shirai",
    year = "1993",
    month = "11",
    language = "English",
    volume = "E76-A",
    pages = "1949--1956",
    journal = "IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences",
    issn = "0916-8508",
    publisher = "Maruzen Co., Ltd/Maruzen Kabushikikaisha",
    number = "11",

    }

    TY - JOUR

    T1 - High Quality Synthetic Speech Generation Using Synchronized Oscillators

    AU - Hashimoto, Kenji

    AU - Mochida, Takemi

    AU - Sato, Yasuaki

    AU - Kobayashi, Tetsunori

    AU - Shirai, Katsuhiko

    PY - 1993/11

    Y1 - 1993/11

    N2 - For the production of high quality synthetic sounds in a text-to-speech system, an excellent synthesizing method of speech signals is indispensable. In this paper, a new speech analysis-synthesis method for the text-to-speech system is proposed. The signals of voiced speed, which have a line spectrum structure at intervals of pitch in the linear frequency domain, can be represented approximately by the superposition of sinusoidal waves. In our system, analysis and synthesis are performed using such a harmonic structure of the signals of voiced speech. In the analysis phase, assuming an exact harmonic structure model at intervals of pitch against the fine structure of the short-time power spectrum, the fundamental frequency fo is decided so as to minimize the error of the log-power spectrum at each peak position. At the same time, according to the value of the above minimized error, the rate of periodicity of the speech signal is determined. Then the log-power spectrum envelope is represented by the cosine-series interpolating the data which are sampled at every pitch period. In the synthesis phase, numerical solutions of non-linear differential equations which generate sinusoidal waves are used. For voiced sounds, those equations behave as a group of mutually synchronized oscillators. These sinusoidal waves are superposed so as to reconstruct the line spectrum structure. For voiceless sounds, those non-linear differential equations work as passive filters with input noise sources. Our system has some characteristics as follows. (1) Voiced and voiceless sounds can be treated in a same framework. (2) Since the phase and the power information of each sinusoidal wave can be easily controlled, if necessary, periodic waveforms in the voiced sounds can be precisely reproduced in the time domain. (3) The fundamental frequency fo and phoneme duration can be easily changed without much degradation of original sound quality.

    AB - For the production of high quality synthetic sounds in a text-to-speech system, an excellent synthesizing method of speech signals is indispensable. In this paper, a new speech analysis-synthesis method for the text-to-speech system is proposed. The signals of voiced speed, which have a line spectrum structure at intervals of pitch in the linear frequency domain, can be represented approximately by the superposition of sinusoidal waves. In our system, analysis and synthesis are performed using such a harmonic structure of the signals of voiced speech. In the analysis phase, assuming an exact harmonic structure model at intervals of pitch against the fine structure of the short-time power spectrum, the fundamental frequency fo is decided so as to minimize the error of the log-power spectrum at each peak position. At the same time, according to the value of the above minimized error, the rate of periodicity of the speech signal is determined. Then the log-power spectrum envelope is represented by the cosine-series interpolating the data which are sampled at every pitch period. In the synthesis phase, numerical solutions of non-linear differential equations which generate sinusoidal waves are used. For voiced sounds, those equations behave as a group of mutually synchronized oscillators. These sinusoidal waves are superposed so as to reconstruct the line spectrum structure. For voiceless sounds, those non-linear differential equations work as passive filters with input noise sources. Our system has some characteristics as follows. (1) Voiced and voiceless sounds can be treated in a same framework. (2) Since the phase and the power information of each sinusoidal wave can be easily controlled, if necessary, periodic waveforms in the voiced sounds can be precisely reproduced in the time domain. (3) The fundamental frequency fo and phoneme duration can be easily changed without much degradation of original sound quality.

    UR - http://www.scopus.com/inward/record.url?scp=0027698916&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=0027698916&partnerID=8YFLogxK

    M3 - Article

    AN - SCOPUS:0027698916

    VL - E76-A

    SP - 1949

    EP - 1956

    JO - IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences

    JF - IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences

    SN - 0916-8508

    IS - 11

    ER -