TY - GEN
T1 - Do We Need Sound for Sound Source Localization?
AU - Oya, Takashi
AU - Iwase, Shohei
AU - Natsume, Ryota
AU - Itazuri, Takahiro
AU - Yamaguchi, Shugo
AU - Morishima, Shigeo
N1 - Funding Information:
Acknowledgements. This research is supported by the JST ACCEL (JPM-JAC1602), JST-Mirai Program (JPMJMI19B2), JSPS KAKENHI (JP17H06101, JP19H01129 and JP19H04137).
Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - During the performance of sound source localization which uses both visual and aural information, it presently remains unclear how much either image or sound modalities contribute to the result, i.e. do we need both image and sound for sound source localization? To address this question, we develop an unsupervised learning system that solves sound source localization by decomposing this task into two steps: (i) “potential sound source localization”, a step that localizes possible sound sources using only visual information (ii) “object selection”, a step that identifies which objects are actually sounding using aural information. Our overall system achieves state-of-the-art performance in sound source localization, and more importantly, we find that despite the constraint on available information, the results of (i) achieve similar performance. From this observation and further experiments, we show that visual information is dominant in “sound” source localization when evaluated with the currently adopted benchmark dataset. Moreover, we show that the majority of sound-producing objects within the samples in this dataset can be inherently identified using only visual information, and thus that the dataset is inadequate to evaluate a system’s capability to leverage aural information. As an alternative, we present an evaluation protocol that enforces both visual and aural information to be leveraged, and verify this property through several experiments.
AB - During the performance of sound source localization which uses both visual and aural information, it presently remains unclear how much either image or sound modalities contribute to the result, i.e. do we need both image and sound for sound source localization? To address this question, we develop an unsupervised learning system that solves sound source localization by decomposing this task into two steps: (i) “potential sound source localization”, a step that localizes possible sound sources using only visual information (ii) “object selection”, a step that identifies which objects are actually sounding using aural information. Our overall system achieves state-of-the-art performance in sound source localization, and more importantly, we find that despite the constraint on available information, the results of (i) achieve similar performance. From this observation and further experiments, we show that visual information is dominant in “sound” source localization when evaluated with the currently adopted benchmark dataset. Moreover, we show that the majority of sound-producing objects within the samples in this dataset can be inherently identified using only visual information, and thus that the dataset is inadequate to evaluate a system’s capability to leverage aural information. As an alternative, we present an evaluation protocol that enforces both visual and aural information to be leveraged, and verify this property through several experiments.
KW - Cross-modal learning
KW - Sound source localization
KW - Unsupervised learning
UR - http://www.scopus.com/inward/record.url?scp=85103278103&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85103278103&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-69544-6_8
DO - 10.1007/978-3-030-69544-6_8
M3 - Conference contribution
AN - SCOPUS:85103278103
SN - 9783030695439
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 119
EP - 136
BT - Computer Vision – ACCV 2020 - 15th Asian Conference on Computer Vision, 2020, Revised Selected Papers
A2 - Ishikawa, Hiroshi
A2 - Liu, Cheng-Lin
A2 - Pajdla, Tomas
A2 - Shi, Jianbo
PB - Springer Science and Business Media Deutschland GmbH
T2 - 15th Asian Conference on Computer Vision, ACCV 2020
Y2 - 30 November 2020 through 4 December 2020
ER -