MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

Junjie Wang, Yatai Ji, Jiaqi Sun, Yujiu Yang, Tetsuya Sakai*

*この研究の対応する著者

研究成果: Conference contribution

抄録

In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model of Do et al. (2019) efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intramodality information. Inspired by these observations, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pretrain MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.

本文言語English
ホスト出版物のタイトルFindings of the Association for Computational Linguistics, Findings of ACL
ホスト出版物のサブタイトルEMNLP 2021
編集者Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-Tau Yih
出版社Association for Computational Linguistics (ACL)
ページ2280-2292
ページ数13
ISBN(電子版)9781955917100
出版ステータスPublished - 2021
イベント2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021 - Punta Cana, Dominican Republic
継続期間: 2021 11月 72021 11月 11

出版物シリーズ

名前Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021

Conference

Conference2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021
国/地域Dominican Republic
CityPunta Cana
Period21/11/721/11/11

ASJC Scopus subject areas

  • 言語および言語学
  • 言語学および言語

フィンガープリント

「MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル