Adaptive Cross-Modal Embeddings for Image-Text Alignment

Authors: Jonatas Wehrmann, Camila Kolling, Rodrigo C Barros12313-12320

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on two large-scale Image-Text alignment datasets show that ADAPT-models outperform all the baseline approaches by large margins.
Researcher Affiliation Academia Jˆonatas Wehrmann, Camila Kolling, Rodrigo C. Barros Machine Intelligence and Robotics Research Group School of Technology, Pontific ıcia Universidade Cat olica do Rio Grande do Sul Av. Ipiranga, 6681, 90619-900, Porto Alegre, RS, Brazil Email: {jonatas.wehrmann, camila.kolling}@edu.pucrs.br, rodrigo.barros@pucrs.br
Pseudocode No The paper contains mathematical formulations and diagrams but no explicit pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/jwehrmann/ retrieval.pytorch.
Open Datasets Yes We train and evaluate our models in two large-scale multimodal datasets, namely MS COCO (Lin et al. 2014) and Flickr30k (Plummer et al. 2015).
Dataset Splits Yes MS COCO [...] comprises 113,287 images for training, 5,000 images for validation, and 5,000 images for testing. Flickr30k comprehends roughly 28,000 images for training and 1,000 for both validation and testing. We used the same splits as those used by state-of-the-art approaches.
Hardware Specification Yes We ran all the time experiments on a server equipped with GTX 1080Ti GPU, 128GB RAM and Intel Core i9.
Software Dependencies No The paper mentions software components like GRU networks, Faster R-CNN, ResNet152, but does not specify version numbers for any software libraries or frameworks (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup Yes To counteract this issue, we make use of the loss introduced in (Wehrmann et al. 2019), which gives more importance for the hard-contrastive instances accordingly to the number of gradient descent steps performed. Such a loss function is as follows, J = τ(ϵ) Jm + (1 τ(ϵ)) Js (10) τ = (1 ηϵ) (11)... We use a fixed value of k = 36 [for image regions].