Deep Fragment Embeddings for Bidirectional Image Sentence Mapping

Authors: Andrej Karpathy, Armand Joulin, Li F Fei-Fei

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experimental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit. and 4 Experiments Datasets. We evaluate our image-sentence retrieval performance on Pascal1K [2], Flickr8K [3] and Flickr30K [4] datasets.
Researcher Affiliation Academia Andrej Karpathy Armand Joulin Li Fei-Fei Department of Computer Science, Stanford University, Stanford, CA 94305, USA {karpathy,ajoulin,feifeili}@cs.stanford.edu
Pseudocode No The paper describes the model and optimization process using text and mathematical equations but does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We make our code publicly available.
Open Datasets Yes We evaluate our image-sentence retrieval performance on Pascal1K [2], Flickr8K [3] and Flickr30K [4] datasets.
Dataset Splits Yes For Pascal1K we follow Socher et al. [22] and use 800 images for training, 100 for validation and 100 for testing. For Flickr datasets we use 1,000 images for validation, 1,000 for testing and the rest for training (consistent with [3]).
Hardware Specification Yes On our machine with a Tesla K40 GPU, the RCNN processes one image in approximately 25 seconds.
Software Dependencies No The paper mentions software like Caffe [41] and Stanford Core NLP parser but does not provide specific version numbers for these or other key software components, which is required for reproducibility.
Experiment Setup Yes We use Stochastic Gradient Descent (SGD) with mini-batches of 100, momentum of 0.9 and make 20 epochs through the training data. The learning rate is cross-validated and annealed by a fraction of 0.1 for the last two epochs. In practice, we found that it was helpful to add a smoothing term n, since short sentences can otherwise have an advantage (we found that n = 5 works well and that this setting is not very sensitive).