Learn from Your Neighbor: Learning Multi-modal Mappings from Sparse Annotations

Authors: Ashwin Kalyan, Stefan Lee, Anitha Kannan, Dhruv Batra

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first explore the properties of our objective in a controlled toy setting and evaluate the performance w.r.t. the true data distribution. For completeness, we then show results on the multi-label with missing labels task on standard multi-label attribute datasets and finally, discuss the performance of our method on two visually-grounded language generation tasks captioning and question generation.
Researcher Affiliation Collaboration 1Georgia Tech 2Curai 3Facebook AI Research. Correspondence to: Ashwin Kalyan <ashwinkv@gatech.edu>.
Pseudocode No The paper describes its approach using text and mathematical equations, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not contain an explicit statement about releasing source code or a link to a code repository.
Open Datasets Yes We report results on standard image-captioning datasets Flickr-8k (Hodosh et al., 2013), Flickr-30k (Young et al., 2014) and COCO (Lin et al., 2014a).
Dataset Splits Yes We use standard splits (Karpathy & Fei-Fei, 2015) of size 1000 to report results on the first two and a test split of size 5000 for the COCO dataset. ... In our experiments, we use a dataset of size 2048; 512 for training and the rest to for evaluation.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions software components like 'Adam' and 'Resnet-152', but does not provide specific version numbers for these or any underlying frameworks (e.g., PyTorch, TensorFlow).
Experiment Setup Yes We use a two-layered neural network with 32 neurons in each layer and train it via SGD with a learning rate of 4e-5 and a momentum of 0.9. ... For both tasks, we train a model similar to Lu et al. (2017) that uses activations from an Image Net pre-trained Resnet-152 (He et al., 2016) architecture as image-representations. For both captioning and VQG, the learnt LSTM model has one layer, 1024-dimensional hidden states and is optimized using Adam (Kingma & Ba, 2015) with a learning rate of 1e-4. ... The similarities Kij used to weigh the supervision from neighboring data-points are computed in a learnt space got by projecting the image-representations through a 2-layered MLP with 512 hidden units in each layer. As discussed in sec. 2.3, the learning rate for this transformation is is 10 smaller compared to the LSTM parameters.