reproducibilityindex.ai

Leveraging Human Attention in Novel Object Captioning

Authors: Xianyu Chen, Ming Jiang, Qi Zhao

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments conducted on the nocaps and Held-Out COCO datasets demonstrate that our method considerably outperforms the stateof-the-art novel object captioners. Our source code is available at https://github.com/chenxy99/ANOC. 4 Experiments We report experimental details and results to demonstrate the effectiveness of the proposed method. We ﬁrst present datasets, evaluation metrics, and implementation details. We then present quantitative results in comparison with state of the arts, along with extensive ablation studies for different model components. Finally, we present qualitative examples.
Researcher Affiliation	Academia	Xianyu Chen , Ming Jiang , Qi Zhao Department of Computer Science and Engineering, University of Minnesota {chen6582, mjiang}@umn.edu, qzhao@cs.umn.edu
Pseudocode	No	The paper describes the model architecture and procedures in text and with a diagram (Figure 2), but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our source code is available at https://github.com/chenxy99/ANOC.
Open Datasets	Yes	We train our model using the MS COCO training set and evaluate it on the nocaps validation and test sets. The Held-Out COCO dataset [Hendricks et al., 2016] is a subset of MS COCO [Lin et al., 2014] where the following eight object categories are excluded from the training set: bottle, bus, couch, microwave, pizza, racket, suitcase and zebra.
Dataset Splits	Yes	The nocaps dataset consists of 15, 100 images from the Open Images [Kuznetsova et al., 2018] validation and test sets. It is split into a validation set of 4, 500 images and a test of 10, 600 images. We randomly split the COCO validation set and use half of it for validation and the other half for testing, each with 20,252 images.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions using specific models and tools like "Faster R-CNN [Ren et al., 2017]" and "Saliency Attentive Model (SAM) [Cornia et al., 2018b]", but it does not provide version numbers for any of the software components or libraries.
Experiment Setup	Yes	We set the hyperparameters β = 1 and γ = 0.45 based on a grid search, which consistently lead to the optimal performance across different settings with the best CIDEr scores. We implement the CBS following the nocaps baseline in [Agrawal et al., 2019]: We set beam size k = 5 and initialize the FSM with f = 24 states. We incorporate up to Nmin = 3 selected objects as constraints including twoor three-word phrases. We select the highest log-probability caption that satisﬁes at least Nd = 2 constraints. On both datasets, we train the image captioner for 70, 000 iterations with a batch size of 150 [Agrawal et al., 2019] samples and then ﬁne-tune it for 210, 000 iterations with a 0.00005 learning rate and a batch size of 1 using the proposed C-SCST. In the C-SCST, we use the CIDEr-D score as the reward function, since it agrees well with human judgement.