Learn from Your Neighbor: Learning Multi-modal Mappings from Sparse Annotations
Authors: Ashwin Kalyan, Stefan Lee, Anitha Kannan, Dhruv Batra
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first explore the properties of our objective in a controlled toy setting and evaluate the performance w.r.t. the true data distribution. For completeness, we then show results on the multi-label with missing labels task on standard multi-label attribute datasets and finally, discuss the performance of our method on two visually-grounded language generation tasks captioning and question generation. |
| Researcher Affiliation | Collaboration | 1Georgia Tech 2Curai 3Facebook AI Research. Correspondence to: Ashwin Kalyan <ashwinkv@gatech.edu>. |
| Pseudocode | No | The paper describes its approach using text and mathematical equations, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | We report results on standard image-captioning datasets Flickr-8k (Hodosh et al., 2013), Flickr-30k (Young et al., 2014) and COCO (Lin et al., 2014a). |
| Dataset Splits | Yes | We use standard splits (Karpathy & Fei-Fei, 2015) of size 1000 to report results on the first two and a test split of size 5000 for the COCO dataset. ... In our experiments, we use a dataset of size 2048; 512 for training and the rest to for evaluation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions software components like 'Adam' and 'Resnet-152', but does not provide specific version numbers for these or any underlying frameworks (e.g., PyTorch, TensorFlow). |
| Experiment Setup | Yes | We use a two-layered neural network with 32 neurons in each layer and train it via SGD with a learning rate of 4e-5 and a momentum of 0.9. ... For both tasks, we train a model similar to Lu et al. (2017) that uses activations from an Image Net pre-trained Resnet-152 (He et al., 2016) architecture as image-representations. For both captioning and VQG, the learnt LSTM model has one layer, 1024-dimensional hidden states and is optimized using Adam (Kingma & Ba, 2015) with a learning rate of 1e-4. ... The similarities Kij used to weigh the supervision from neighboring data-points are computed in a learnt space got by projecting the image-representations through a 2-layered MLP with 512 hidden units in each layer. As discussed in sec. 2.3, the learning rate for this transformation is is 10 smaller compared to the LSTM parameters. |