Learning Unsupervised Visual Grounding Through Semantic Self-Supervision
Authors: Syed Ashar Javed, Shreyas Saxena, Vineet Gandhi
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present thorough quantitative and qualitative experiments to demonstrate the efficacy of our approach and show a 5.6% improvement over the current state of the art on Visual Genome dataset, a 5.8% improvement on the Refer It Game dataset and comparable to state-of-art performance on the Flickr30k dataset. |
| Researcher Affiliation | Academia | 1The Robotics Institute, Carnegie Mellon University 2CVIT, Kohli Center of Intelligent Systems (KCIS), IIIT Hyderabad sajaved@andrew.cmu.edu, shreyas.saxena2@gmail.com, vgandhi@iiit.ac.in |
| Pseudocode | No | The paper describes mathematical formulations and model architecture but does not include any distinct pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any specific links or explicit statements about releasing the source code for the methodology described. |
| Open Datasets | Yes | We test our method on the Visual Genome [Krishna et al., 2017], the Refer It Game [Kazemzadeh et al., 2014] and the Flickr30k Entities [Plummer et al., 2015] datasets |
| Dataset Splits | No | The paper mentions using the "validation set of MS-COCO" as part of their test set, but does not specify a separate validation split for their own model training process or for hyperparameter tuning to ensure reproducibility of training. |
| Hardware Specification | No | The paper does not specify any particular hardware components such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using a "VGG16" model and a "Google 1 Billion trained language model", but does not provide specific version numbers for these or other software dependencies, such as Python libraries or frameworks. |
| Experiment Setup | Yes | In the encoder, the values of p, q, r, s from Equation 2 are taken as 512, 128, 32, 1 respectively. The concept vocabulary used for the softmax based loss is taken from the most frequently occurring nouns. ... around 95% of the phrases are accounted for by the top 2000 concepts, which is used as the softmax size. ... For the discussion in this section, we use the shorthand IC (independent concept only), CC (common concept only) and ICC (independent and common concept) for the three loss types from Table 3. We train our model with the IC and CC loss separately, keeping everything else in the pipeline fixed. For all three settings, we vary the concept batch size k and observe some interesting trends. |