MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

Junjie Wang, Yatai Ji, Jiaqi Sun, Yujiu Yang, Tetsuya Sakai*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model of Do et al. (2019) efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intramodality information. Inspired by these observations, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pretrain MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics, Findings of ACL
Subtitle of host publicationEMNLP 2021
EditorsMarie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-Tau Yih
PublisherAssociation for Computational Linguistics (ACL)
Pages2280-2292
Number of pages13
ISBN (Electronic)9781955917100
Publication statusPublished - 2021
Event2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 - Punta Cana, Dominican Republic
Duration: 2021 Nov 72021 Nov 11

Publication series

NameFindings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021

Conference

Conference2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021
Country/TerritoryDominican Republic
CityPunta Cana
Period21/11/721/11/11

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering'. Together they form a unique fingerprint.

Cite this