Learning Unsupervised Visual Grounding Through Semantic Self-Supervision

Authors: Syed Ashar Javed, Shreyas Saxena, Vineet Gandhi

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present thorough quantitative and qualitative experiments to demonstrate the efficacy of our approach and show a 5.6% improvement over the current state of the art on Visual Genome dataset, a 5.8% improvement on the Refer It Game dataset and comparable to state-of-art performance on the Flickr30k dataset.
Researcher Affiliation Academia 1The Robotics Institute, Carnegie Mellon University 2CVIT, Kohli Center of Intelligent Systems (KCIS), IIIT Hyderabad sajaved@andrew.cmu.edu, shreyas.saxena2@gmail.com, vgandhi@iiit.ac.in
Pseudocode No The paper describes mathematical formulations and model architecture but does not include any distinct pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any specific links or explicit statements about releasing the source code for the methodology described.
Open Datasets Yes We test our method on the Visual Genome [Krishna et al., 2017], the Refer It Game [Kazemzadeh et al., 2014] and the Flickr30k Entities [Plummer et al., 2015] datasets
Dataset Splits No The paper mentions using the "validation set of MS-COCO" as part of their test set, but does not specify a separate validation split for their own model training process or for hyperparameter tuning to ensure reproducibility of training.
Hardware Specification No The paper does not specify any particular hardware components such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using a "VGG16" model and a "Google 1 Billion trained language model", but does not provide specific version numbers for these or other software dependencies, such as Python libraries or frameworks.
Experiment Setup Yes In the encoder, the values of p, q, r, s from Equation 2 are taken as 512, 128, 32, 1 respectively. The concept vocabulary used for the softmax based loss is taken from the most frequently occurring nouns. ... around 95% of the phrases are accounted for by the top 2000 concepts, which is used as the softmax size. ... For the discussion in this section, we use the shorthand IC (independent concept only), CC (common concept only) and ICC (independent and common concept) for the three loss types from Table 3. We train our model with the IC and CC loss separately, keeping everything else in the pipeline fixed. For all three settings, we vary the concept batch size k and observe some interesting trends.