Contrastive Vision-Language Pre-training with Limited Resources

Quan Cui, Boyan Zhou, Yu Guo, Weidong Yin, Hao Wu*, Osamu Yoshie, Yubo Chen

*この研究の対応する著者

研究成果: Conference contribution

抄録

Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have revealed the potential of aligning multi-modal representations with contrastive learning. However, these works require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we propose a stack of novel methods, which significantly cut down the heavy resource dependency and allow us to conduct dual-encoder multi-modal representation alignment with limited resources. Besides, we provide a reproducible baseline of competitive results, namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100 GPUs. Additionally, we collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods, further proving the effectiveness of our methods on large-scale data. We hope that this work will provide useful data points and experience for future research in contrastive vision-language pre-training. Code is available at https://github.com/zerovl/ZeroVL.

本文言語English
ホスト出版物のタイトルComputer Vision – ECCV 2022 - 17th European Conference, 2022, Proceedings
編集者Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
出版社Springer Science and Business Media Deutschland GmbH
ページ236-253
ページ数18
ISBN(印刷版)9783031200588
DOI
出版ステータスPublished - 2022
イベント17th European Conference on Computer Vision, ECCV 2022 - Tel Aviv, Israel
継続期間: 2022 10月 232022 10月 27

出版物シリーズ

名前Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
13696 LNCS
ISSN(印刷版)0302-9743
ISSN(電子版)1611-3349

Conference

Conference17th European Conference on Computer Vision, ECCV 2022
国/地域Israel
CityTel Aviv
Period22/10/2322/10/27

ASJC Scopus subject areas

  • 理論的コンピュータサイエンス
  • コンピュータ サイエンス(全般)

フィンガープリント

「Contrastive Vision-Language Pre-training with Limited Resources」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル