Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural dihard challenge

Gregory Sell, David Snyder, Alan McCree, Daniel Garcia-Romero, Jesús Villalba, Matthew Maciejewski, Vimal Manohar, Najim Dehak, Daniel Povey, Shinji Watanabe, Sanjeev Khudanpur

Research output: Contribution to journalConference article

22 Citations (Scopus)

Abstract

We describe in this paper the experiences of the Johns Hopkins University team during the inaugural DIHARD diarization evaluation. This new task provided microphone recordings in a variety of difficult conditions and challenged researchers to fully consider all speaker activity, without the currently typical practices of unscored collars or ignored overlapping speaker segments. This paper explores several key aspects of currently state-of-the-art diarization methods, such as training data selection, signal bandwidth for feature extraction, representations of speech segments (i-vector versus x-vector), and domain-adaptive processing. In the end, our best system clustered x-vector embeddings trained on wideband microphone data followed by Variational-Bayesian refinement, and a speech activity detector specifically trained for this task with in-domain data was found to be the best performing. After presenting these decisions and their final result, we discuss lessons learned and remaining challenges within the lens of this new approach to diarization performance measurement.

Original languageEnglish
Pages (from-to)2808-2812
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
Publication statusPublished - 2018 Jan 1
Externally publishedYes
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: 2018 Sep 22018 Sep 6

Fingerprint

Microphones
Performance Measurement
Feature Extraction
Lens
Overlapping
Feature extraction
Lenses
Refinement
Bandwidth
Detector
Detectors
Evaluation
Processing
Experience
Inaugural
Speech
Training
Data Selection

Keywords

  • Speaker diarization

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modelling and Simulation

Cite this

Diarization is hard : Some experiences and lessons learned for the JHU team in the inaugural dihard challenge. / Sell, Gregory; Snyder, David; McCree, Alan; Garcia-Romero, Daniel; Villalba, Jesús; Maciejewski, Matthew; Manohar, Vimal; Dehak, Najim; Povey, Daniel; Watanabe, Shinji; Khudanpur, Sanjeev.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2018-September, 01.01.2018, p. 2808-2812.

Research output: Contribution to journalConference article

Sell, Gregory ; Snyder, David ; McCree, Alan ; Garcia-Romero, Daniel ; Villalba, Jesús ; Maciejewski, Matthew ; Manohar, Vimal ; Dehak, Najim ; Povey, Daniel ; Watanabe, Shinji ; Khudanpur, Sanjeev. / Diarization is hard : Some experiences and lessons learned for the JHU team in the inaugural dihard challenge. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2018 ; Vol. 2018-September. pp. 2808-2812.
@article{88bc6b84536a45eb8f0a71f5a199a1d4,
title = "Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural dihard challenge",
abstract = "We describe in this paper the experiences of the Johns Hopkins University team during the inaugural DIHARD diarization evaluation. This new task provided microphone recordings in a variety of difficult conditions and challenged researchers to fully consider all speaker activity, without the currently typical practices of unscored collars or ignored overlapping speaker segments. This paper explores several key aspects of currently state-of-the-art diarization methods, such as training data selection, signal bandwidth for feature extraction, representations of speech segments (i-vector versus x-vector), and domain-adaptive processing. In the end, our best system clustered x-vector embeddings trained on wideband microphone data followed by Variational-Bayesian refinement, and a speech activity detector specifically trained for this task with in-domain data was found to be the best performing. After presenting these decisions and their final result, we discuss lessons learned and remaining challenges within the lens of this new approach to diarization performance measurement.",
keywords = "Speaker diarization",
author = "Gregory Sell and David Snyder and Alan McCree and Daniel Garcia-Romero and Jes{\'u}s Villalba and Matthew Maciejewski and Vimal Manohar and Najim Dehak and Daniel Povey and Shinji Watanabe and Sanjeev Khudanpur",
year = "2018",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2018-1893",
language = "English",
volume = "2018-September",
pages = "2808--2812",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Diarization is hard

T2 - Some experiences and lessons learned for the JHU team in the inaugural dihard challenge

AU - Sell, Gregory

AU - Snyder, David

AU - McCree, Alan

AU - Garcia-Romero, Daniel

AU - Villalba, Jesús

AU - Maciejewski, Matthew

AU - Manohar, Vimal

AU - Dehak, Najim

AU - Povey, Daniel

AU - Watanabe, Shinji

AU - Khudanpur, Sanjeev

PY - 2018/1/1

Y1 - 2018/1/1

N2 - We describe in this paper the experiences of the Johns Hopkins University team during the inaugural DIHARD diarization evaluation. This new task provided microphone recordings in a variety of difficult conditions and challenged researchers to fully consider all speaker activity, without the currently typical practices of unscored collars or ignored overlapping speaker segments. This paper explores several key aspects of currently state-of-the-art diarization methods, such as training data selection, signal bandwidth for feature extraction, representations of speech segments (i-vector versus x-vector), and domain-adaptive processing. In the end, our best system clustered x-vector embeddings trained on wideband microphone data followed by Variational-Bayesian refinement, and a speech activity detector specifically trained for this task with in-domain data was found to be the best performing. After presenting these decisions and their final result, we discuss lessons learned and remaining challenges within the lens of this new approach to diarization performance measurement.

AB - We describe in this paper the experiences of the Johns Hopkins University team during the inaugural DIHARD diarization evaluation. This new task provided microphone recordings in a variety of difficult conditions and challenged researchers to fully consider all speaker activity, without the currently typical practices of unscored collars or ignored overlapping speaker segments. This paper explores several key aspects of currently state-of-the-art diarization methods, such as training data selection, signal bandwidth for feature extraction, representations of speech segments (i-vector versus x-vector), and domain-adaptive processing. In the end, our best system clustered x-vector embeddings trained on wideband microphone data followed by Variational-Bayesian refinement, and a speech activity detector specifically trained for this task with in-domain data was found to be the best performing. After presenting these decisions and their final result, we discuss lessons learned and remaining challenges within the lens of this new approach to diarization performance measurement.

KW - Speaker diarization

UR - http://www.scopus.com/inward/record.url?scp=85055003941&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85055003941&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-1893

DO - 10.21437/Interspeech.2018-1893

M3 - Conference article

AN - SCOPUS:85055003941

VL - 2018-September

SP - 2808

EP - 2812

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -