Multi-modal joint embedding for fashion product retrieval

A. Rubio, Longlong Yu, Edgar Simo Serra, F. Moreno-Noguer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem, akin to finding a needle in a haystack. In this paper, we leverage both the images and textual metadata and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space, which is both efficient and accurate. We train this embedding using large-scale real world e-commerce data by both minimizing the similarity between related products and using auxiliary classification networks to that encourage the embedding to have semantic meaning. We compare against existing approaches and show significant improvements in retrieval tasks on a large-scale e-commerce dataset. We also provide an analysis of the different metadata.

Original languageEnglish
Title of host publication2017 IEEE International Conference on Image Processing, ICIP 2017 - Proceedings
PublisherIEEE Computer Society
Pages400-404
Number of pages5
Volume2017-September
ISBN (Electronic)9781509021758
DOIs
Publication statusPublished - 2018 Feb 20
Event24th IEEE International Conference on Image Processing, ICIP 2017 - Beijing, China
Duration: 2017 Sep 172017 Sep 20

Other

Other24th IEEE International Conference on Image Processing, ICIP 2017
CountryChina
CityBeijing
Period17/9/1717/9/20

Fingerprint

Metadata
Semantics

Keywords

  • Multi-modal embedding
  • Neural networks
  • Retrieval

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition
  • Signal Processing

Cite this

Rubio, A., Yu, L., Simo Serra, E., & Moreno-Noguer, F. (2018). Multi-modal joint embedding for fashion product retrieval. In 2017 IEEE International Conference on Image Processing, ICIP 2017 - Proceedings (Vol. 2017-September, pp. 400-404). IEEE Computer Society. https://doi.org/10.1109/ICIP.2017.8296311

Multi-modal joint embedding for fashion product retrieval. / Rubio, A.; Yu, Longlong; Simo Serra, Edgar; Moreno-Noguer, F.

2017 IEEE International Conference on Image Processing, ICIP 2017 - Proceedings. Vol. 2017-September IEEE Computer Society, 2018. p. 400-404.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Rubio, A, Yu, L, Simo Serra, E & Moreno-Noguer, F 2018, Multi-modal joint embedding for fashion product retrieval. in 2017 IEEE International Conference on Image Processing, ICIP 2017 - Proceedings. vol. 2017-September, IEEE Computer Society, pp. 400-404, 24th IEEE International Conference on Image Processing, ICIP 2017, Beijing, China, 17/9/17. https://doi.org/10.1109/ICIP.2017.8296311
Rubio A, Yu L, Simo Serra E, Moreno-Noguer F. Multi-modal joint embedding for fashion product retrieval. In 2017 IEEE International Conference on Image Processing, ICIP 2017 - Proceedings. Vol. 2017-September. IEEE Computer Society. 2018. p. 400-404 https://doi.org/10.1109/ICIP.2017.8296311
Rubio, A. ; Yu, Longlong ; Simo Serra, Edgar ; Moreno-Noguer, F. / Multi-modal joint embedding for fashion product retrieval. 2017 IEEE International Conference on Image Processing, ICIP 2017 - Proceedings. Vol. 2017-September IEEE Computer Society, 2018. pp. 400-404
@inproceedings{2d32681110c142a0bb1039fdc112ecd2,
title = "Multi-modal joint embedding for fashion product retrieval",
abstract = "Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem, akin to finding a needle in a haystack. In this paper, we leverage both the images and textual metadata and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space, which is both efficient and accurate. We train this embedding using large-scale real world e-commerce data by both minimizing the similarity between related products and using auxiliary classification networks to that encourage the embedding to have semantic meaning. We compare against existing approaches and show significant improvements in retrieval tasks on a large-scale e-commerce dataset. We also provide an analysis of the different metadata.",
keywords = "Multi-modal embedding, Neural networks, Retrieval",
author = "A. Rubio and Longlong Yu and {Simo Serra}, Edgar and F. Moreno-Noguer",
year = "2018",
month = "2",
day = "20",
doi = "10.1109/ICIP.2017.8296311",
language = "English",
volume = "2017-September",
pages = "400--404",
booktitle = "2017 IEEE International Conference on Image Processing, ICIP 2017 - Proceedings",
publisher = "IEEE Computer Society",

}

TY - GEN

T1 - Multi-modal joint embedding for fashion product retrieval

AU - Rubio, A.

AU - Yu, Longlong

AU - Simo Serra, Edgar

AU - Moreno-Noguer, F.

PY - 2018/2/20

Y1 - 2018/2/20

N2 - Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem, akin to finding a needle in a haystack. In this paper, we leverage both the images and textual metadata and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space, which is both efficient and accurate. We train this embedding using large-scale real world e-commerce data by both minimizing the similarity between related products and using auxiliary classification networks to that encourage the embedding to have semantic meaning. We compare against existing approaches and show significant improvements in retrieval tasks on a large-scale e-commerce dataset. We also provide an analysis of the different metadata.

AB - Finding a product in the fashion world can be a daunting task. Everyday, e-commerce sites are updating with thousands of images and their associated metadata (textual information), deepening the problem, akin to finding a needle in a haystack. In this paper, we leverage both the images and textual metadata and propose a joint multi-modal embedding that maps both the text and images into a common latent space. Distances in the latent space correspond to similarity between products, allowing us to effectively perform retrieval in this latent space, which is both efficient and accurate. We train this embedding using large-scale real world e-commerce data by both minimizing the similarity between related products and using auxiliary classification networks to that encourage the embedding to have semantic meaning. We compare against existing approaches and show significant improvements in retrieval tasks on a large-scale e-commerce dataset. We also provide an analysis of the different metadata.

KW - Multi-modal embedding

KW - Neural networks

KW - Retrieval

UR - http://www.scopus.com/inward/record.url?scp=85045320025&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85045320025&partnerID=8YFLogxK

U2 - 10.1109/ICIP.2017.8296311

DO - 10.1109/ICIP.2017.8296311

M3 - Conference contribution

AN - SCOPUS:85045320025

VL - 2017-September

SP - 400

EP - 404

BT - 2017 IEEE International Conference on Image Processing, ICIP 2017 - Proceedings

PB - IEEE Computer Society

ER -