Adaptive Cross-Modal Embeddings for Image-Text Alignment
Authors: Jonatas Wehrmann, Camila Kolling, Rodrigo C Barros12313-12320
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on two large-scale Image-Text alignment datasets show that ADAPT-models outperform all the baseline approaches by large margins. |
| Researcher Affiliation | Academia | Jˆonatas Wehrmann, Camila Kolling, Rodrigo C. Barros Machine Intelligence and Robotics Research Group School of Technology, Pontific ıcia Universidade Cat olica do Rio Grande do Sul Av. Ipiranga, 6681, 90619-900, Porto Alegre, RS, Brazil Email: {jonatas.wehrmann, camila.kolling}@edu.pucrs.br, rodrigo.barros@pucrs.br |
| Pseudocode | No | The paper contains mathematical formulations and diagrams but no explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/jwehrmann/ retrieval.pytorch. |
| Open Datasets | Yes | We train and evaluate our models in two large-scale multimodal datasets, namely MS COCO (Lin et al. 2014) and Flickr30k (Plummer et al. 2015). |
| Dataset Splits | Yes | MS COCO [...] comprises 113,287 images for training, 5,000 images for validation, and 5,000 images for testing. Flickr30k comprehends roughly 28,000 images for training and 1,000 for both validation and testing. We used the same splits as those used by state-of-the-art approaches. |
| Hardware Specification | Yes | We ran all the time experiments on a server equipped with GTX 1080Ti GPU, 128GB RAM and Intel Core i9. |
| Software Dependencies | No | The paper mentions software components like GRU networks, Faster R-CNN, ResNet152, but does not specify version numbers for any software libraries or frameworks (e.g., PyTorch, TensorFlow, Python version). |
| Experiment Setup | Yes | To counteract this issue, we make use of the loss introduced in (Wehrmann et al. 2019), which gives more importance for the hard-contrastive instances accordingly to the number of gradient descent steps performed. Such a loss function is as follows, J = τ(ϵ) Jm + (1 τ(ϵ)) Js (10) τ = (1 ηϵ) (11)... We use a fixed value of k = 36 [for image regions]. |