Dictation of multiparty conversation using statistical turn taking model and speaker model

Noriyuki Murai, Tetsunori Kobayashi

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    6 Citations (Scopus)

    Abstract

    A new speech decoder dealing with multiparty conversation is proposed. Multiparty conversation denotes a situation in which many speakers talk each other. Almost of all conventional speech recognition systems assume that the input data consist of single speaker's voice. However, some applications, such as dialogue dictation and voice interface for multi-users, have to deal with mixed speakers' voice. In such a situation, the system has to recognize not only the word sequence of the input speech but also the speaker of each part of them. Therefore, we propose a decoder utilizing not only acoustic model and language model, which are the resources of conventional single-user speech decoder, but also statistic turn taking model and speakers models to recognize speech. This framework realizes simultaneous maximum likelihood estimation of spoken word sequence and the speaker sequence. Experimental results using TV sports news show that the proposed method reduce the word error rate by 7.7% and speaker error rate by 97.8% compared to the conventional method.

    Original languageEnglish
    Title of host publicationICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
    PublisherIEEE
    Pages1575-1578
    Number of pages4
    Volume3
    Publication statusPublished - 2000
    Event2000 IEEE Interntional Conference on Acoustics, Speech, and Signal Processing - Istanbul, Turkey
    Duration: 2000 Jun 52000 Jun 9

    Other

    Other2000 IEEE Interntional Conference on Acoustics, Speech, and Signal Processing
    CityIstanbul, Turkey
    Period00/6/500/6/9

    Fingerprint

    conversation
    decoders
    news
    speech recognition
    resources
    statistics
    Maximum likelihood estimation
    acoustics
    Sports
    Speech recognition
    Acoustics
    Statistics

    ASJC Scopus subject areas

    • Signal Processing
    • Electrical and Electronic Engineering
    • Acoustics and Ultrasonics

    Cite this

    Murai, N., & Kobayashi, T. (2000). Dictation of multiparty conversation using statistical turn taking model and speaker model. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings (Vol. 3, pp. 1575-1578). IEEE.

    Dictation of multiparty conversation using statistical turn taking model and speaker model. / Murai, Noriyuki; Kobayashi, Tetsunori.

    ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Vol. 3 IEEE, 2000. p. 1575-1578.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Murai, N & Kobayashi, T 2000, Dictation of multiparty conversation using statistical turn taking model and speaker model. in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. vol. 3, IEEE, pp. 1575-1578, 2000 IEEE Interntional Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, 00/6/5.
    Murai N, Kobayashi T. Dictation of multiparty conversation using statistical turn taking model and speaker model. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Vol. 3. IEEE. 2000. p. 1575-1578
    Murai, Noriyuki ; Kobayashi, Tetsunori. / Dictation of multiparty conversation using statistical turn taking model and speaker model. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. Vol. 3 IEEE, 2000. pp. 1575-1578
    @inproceedings{fef627509801412ab735a60accefdecf,
    title = "Dictation of multiparty conversation using statistical turn taking model and speaker model",
    abstract = "A new speech decoder dealing with multiparty conversation is proposed. Multiparty conversation denotes a situation in which many speakers talk each other. Almost of all conventional speech recognition systems assume that the input data consist of single speaker's voice. However, some applications, such as dialogue dictation and voice interface for multi-users, have to deal with mixed speakers' voice. In such a situation, the system has to recognize not only the word sequence of the input speech but also the speaker of each part of them. Therefore, we propose a decoder utilizing not only acoustic model and language model, which are the resources of conventional single-user speech decoder, but also statistic turn taking model and speakers models to recognize speech. This framework realizes simultaneous maximum likelihood estimation of spoken word sequence and the speaker sequence. Experimental results using TV sports news show that the proposed method reduce the word error rate by 7.7{\%} and speaker error rate by 97.8{\%} compared to the conventional method.",
    author = "Noriyuki Murai and Tetsunori Kobayashi",
    year = "2000",
    language = "English",
    volume = "3",
    pages = "1575--1578",
    booktitle = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
    publisher = "IEEE",

    }

    TY - GEN

    T1 - Dictation of multiparty conversation using statistical turn taking model and speaker model

    AU - Murai, Noriyuki

    AU - Kobayashi, Tetsunori

    PY - 2000

    Y1 - 2000

    N2 - A new speech decoder dealing with multiparty conversation is proposed. Multiparty conversation denotes a situation in which many speakers talk each other. Almost of all conventional speech recognition systems assume that the input data consist of single speaker's voice. However, some applications, such as dialogue dictation and voice interface for multi-users, have to deal with mixed speakers' voice. In such a situation, the system has to recognize not only the word sequence of the input speech but also the speaker of each part of them. Therefore, we propose a decoder utilizing not only acoustic model and language model, which are the resources of conventional single-user speech decoder, but also statistic turn taking model and speakers models to recognize speech. This framework realizes simultaneous maximum likelihood estimation of spoken word sequence and the speaker sequence. Experimental results using TV sports news show that the proposed method reduce the word error rate by 7.7% and speaker error rate by 97.8% compared to the conventional method.

    AB - A new speech decoder dealing with multiparty conversation is proposed. Multiparty conversation denotes a situation in which many speakers talk each other. Almost of all conventional speech recognition systems assume that the input data consist of single speaker's voice. However, some applications, such as dialogue dictation and voice interface for multi-users, have to deal with mixed speakers' voice. In such a situation, the system has to recognize not only the word sequence of the input speech but also the speaker of each part of them. Therefore, we propose a decoder utilizing not only acoustic model and language model, which are the resources of conventional single-user speech decoder, but also statistic turn taking model and speakers models to recognize speech. This framework realizes simultaneous maximum likelihood estimation of spoken word sequence and the speaker sequence. Experimental results using TV sports news show that the proposed method reduce the word error rate by 7.7% and speaker error rate by 97.8% compared to the conventional method.

    UR - http://www.scopus.com/inward/record.url?scp=0033677162&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=0033677162&partnerID=8YFLogxK

    M3 - Conference contribution

    AN - SCOPUS:0033677162

    VL - 3

    SP - 1575

    EP - 1578

    BT - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

    PB - IEEE

    ER -