Audio-guided video interpolation via human pose features

Takayuki Nakatsuka, Masatoshi Hamanaka, Shigeo Morishima

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper describes a method that generates in-between frames of two videos of a musical instrument being played. While image generation achieves a successful outcome in recent years, there is ample scope for improvement in video generation. The keys to improving the quality of video generation are the high resolution and temporal coherence of videos. We solved these requirements by using not only visual information but also aural information. The critical point of our method is using two-dimensional pose features to generate high-resolution in-between frames from the input audio. We constructed a deep neural network with a recurrent structure for inferring pose features from the input audio and an encoder-decoder network for padding and generating video frames using pose features. Our method, moreover, adopted a fusion approach of generating, padding, and retrieving video frames to improve the output video. Pose features played an essential role in both end-to-end training with a differentiable property and combining a generating, padding, and retrieving approach. We conducted a user study and confirmed that the proposed method is effective in generating interpolated videos.

Original languageEnglish
Title of host publicationVISAPP
EditorsGiovanni Maria Farinella, Petia Radeva, Jose Braz
PublisherSciTePress
Pages27-35
Number of pages9
ISBN (Electronic)9789897584022
Publication statusPublished - 2020
Event15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2020 - Valletta, Malta
Duration: 2020 Feb 272020 Feb 29

Publication series

NameVISIGRAPP 2020 - Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
Volume5

Conference

Conference15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2020
CountryMalta
CityValletta
Period20/2/2720/2/29

Keywords

  • Gated Recurrent Unit
  • Generative Adversarial Network
  • Pose Estimation
  • Signal Processing
  • Video Interpolation

ASJC Scopus subject areas

  • Computer Graphics and Computer-Aided Design
  • Computer Science Applications
  • Computer Vision and Pattern Recognition

Fingerprint Dive into the research topics of 'Audio-guided video interpolation via human pose features'. Together they form a unique fingerprint.

  • Cite this

    Nakatsuka, T., Hamanaka, M., & Morishima, S. (2020). Audio-guided video interpolation via human pose features. In G. M. Farinella, P. Radeva, & J. Braz (Eds.), VISAPP (pp. 27-35). (VISIGRAPP 2020 - Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications; Vol. 5). SciTePress.