reproducibilityindex.ai

Adaptive Cross-Modal Embeddings for Image-Text Alignment

Authors: Jonatas Wehrmann, Camila Kolling, Rodrigo C Barros12313-12320

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on two large-scale Image-Text alignment datasets show that ADAPT-models outperform all the baseline approaches by large margins.
Researcher Affiliation	Academia	Jˆonatas Wehrmann, Camila Kolling, Rodrigo C. Barros Machine Intelligence and Robotics Research Group School of Technology, Pontific ıcia Universidade Cat olica do Rio Grande do Sul Av. Ipiranga, 6681, 90619-900, Porto Alegre, RS, Brazil Email: {jonatas.wehrmann, camila.kolling}@edu.pucrs.br, rodrigo.barros@pucrs.br
Pseudocode	No	The paper contains mathematical formulations and diagrams but no explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/jwehrmann/ retrieval.pytorch.
Open Datasets	Yes	We train and evaluate our models in two large-scale multimodal datasets, namely MS COCO (Lin et al. 2014) and Flickr30k (Plummer et al. 2015).
Dataset Splits	Yes	MS COCO [...] comprises 113,287 images for training, 5,000 images for validation, and 5,000 images for testing. Flickr30k comprehends roughly 28,000 images for training and 1,000 for both validation and testing. We used the same splits as those used by state-of-the-art approaches.
Hardware Specification	Yes	We ran all the time experiments on a server equipped with GTX 1080Ti GPU, 128GB RAM and Intel Core i9.
Software Dependencies	No	The paper mentions software components like GRU networks, Faster R-CNN, ResNet152, but does not specify version numbers for any software libraries or frameworks (e.g., PyTorch, TensorFlow, Python version).
Experiment Setup	Yes	To counteract this issue, we make use of the loss introduced in (Wehrmann et al. 2019), which gives more importance for the hard-contrastive instances accordingly to the number of gradient descent steps performed. Such a loss function is as follows, J = τ(ϵ) Jm + (1 τ(ϵ)) Js (10) τ = (1 ηϵ) (11)... We use a fixed value of k = 36 [for image regions].