Image Captioning with Compositional Neural Module Networks
Authors: Junjiao Tian, Jean Oh
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a set of experiments on the MSCOCO dataset, the proposed model outperforms a state-of-the art model across multiple evaluation metrics, more importantly, presenting visually interpretable results. Furthermore, the breakdown of subcategories fscores of the SPICE metric and human evaluation on Amazon Mechanical Turk show that our compositional module networks effectively generate accurate and detailed captions. |
| Researcher Affiliation | Academia | Junjiao Tian and Jean Oh Carnegie Mellon University {junjiaot, hyaejino}@andrew.cmu.edu |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found. |
| Open Source Code | No | No explicit statement or link providing access to the open-source code for the methodology described in this paper was found. |
| Open Datasets | Yes | We use MSCOCO [Lin et al., 2014] for evaluation. MSCOCO contains 82,783 training and 40,504 validation images; for each image, there are 5 human-annotated sentences. We use the widely-used Karpathy Split [Fang et al., 2015] to incorporate portion of the validation images into the training set. In total, we use 123,287 images for training and leave 5K for testing. As a standard practice, we convert all the words in the training set to lower cases and discard those words that occur fewer than 5 times and those do not intersect with the Glo Ve embedding. The result is a vocabulary of 9,947 unique words. For usage of the Visual Genome dataset [Krishna et al., 2017], we reserve 5K images for validation, 5K for testing and 98K images as training data. |
| Dataset Splits | Yes | We use MSCOCO [Lin et al., 2014] for evaluation. MSCOCO contains 82,783 training and 40,504 validation images; for each image, there are 5 human-annotated sentences. We use the widely-used Karpathy Split [Fang et al., 2015] to incorporate portion of the validation images into the training set. In total, we use 123,287 images for training and leave 5K for testing. ... For usage of the Visual Genome dataset [Krishna et al., 2017], we reserve 5K images for validation, 5K for testing and 98K images as training data. |
| Hardware Specification | No | No specific hardware (e.g., GPU model, CPU type, specific cloud instances) used for running the experiments was mentioned in the paper. |
| Software Dependencies | No | The paper mentions 'pretrained Glo Ve embedding [Pennington et al., 2014]' and 'Adam optimizer [Kingma and Ba, 2014]', but does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | We set the number of hidden state units in all LSTMs to 512, and the size of input word embedding to 300. We use a pretrained Glo Ve embedding [Pennington et al., 2014] and do not finetune the embedding during training. The pre-trained embedding is from a public website and consists of 6B tokens in total. In training, we set the initial learning rate as 5e-4 and anneal the learning rate to 2.5e-4 at the end of training starting from the 20th epoch using a fixed batch size of 128. We use the Adam optimizer [Kingma and Ba, 2014] with β1 to be 0.8. We train the Stacked Noisy-Or Object Detector jointly for 5 epoches and stop. The training is complete in 50K iterations. To ensure fair comparison, we re-train the Top-Down using the same hyperparameters as the proposed model. We report the results with greedy decoding to reduce the effect of hyperparameter search for different models. We use the top 36 features in each image as inputs to the captioning models and do not finetune the image features during training. |