Leveraging Human Attention in Novel Object Captioning
Authors: Xianyu Chen, Ming Jiang, Qi Zhao
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conducted on the nocaps and Held-Out COCO datasets demonstrate that our method considerably outperforms the stateof-the-art novel object captioners. Our source code is available at https://github.com/chenxy99/ANOC. 4 Experiments We report experimental details and results to demonstrate the effectiveness of the proposed method. We first present datasets, evaluation metrics, and implementation details. We then present quantitative results in comparison with state of the arts, along with extensive ablation studies for different model components. Finally, we present qualitative examples. |
| Researcher Affiliation | Academia | Xianyu Chen , Ming Jiang , Qi Zhao Department of Computer Science and Engineering, University of Minnesota {chen6582, mjiang}@umn.edu, qzhao@cs.umn.edu |
| Pseudocode | No | The paper describes the model architecture and procedures in text and with a diagram (Figure 2), but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our source code is available at https://github.com/chenxy99/ANOC. |
| Open Datasets | Yes | We train our model using the MS COCO training set and evaluate it on the nocaps validation and test sets. The Held-Out COCO dataset [Hendricks et al., 2016] is a subset of MS COCO [Lin et al., 2014] where the following eight object categories are excluded from the training set: bottle, bus, couch, microwave, pizza, racket, suitcase and zebra. |
| Dataset Splits | Yes | The nocaps dataset consists of 15, 100 images from the Open Images [Kuznetsova et al., 2018] validation and test sets. It is split into a validation set of 4, 500 images and a test of 10, 600 images. We randomly split the COCO validation set and use half of it for validation and the other half for testing, each with 20,252 images. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using specific models and tools like "Faster R-CNN [Ren et al., 2017]" and "Saliency Attentive Model (SAM) [Cornia et al., 2018b]", but it does not provide version numbers for any of the software components or libraries. |
| Experiment Setup | Yes | We set the hyperparameters β = 1 and γ = 0.45 based on a grid search, which consistently lead to the optimal performance across different settings with the best CIDEr scores. We implement the CBS following the nocaps baseline in [Agrawal et al., 2019]: We set beam size k = 5 and initialize the FSM with f = 24 states. We incorporate up to Nmin = 3 selected objects as constraints including twoor three-word phrases. We select the highest log-probability caption that satisfies at least Nd = 2 constraints. On both datasets, we train the image captioner for 70, 000 iterations with a batch size of 150 [Agrawal et al., 2019] samples and then fine-tune it for 210, 000 iterations with a 0.00005 learning rate and a batch size of 1 using the proposed C-SCST. In the C-SCST, we use the CIDEr-D score as the reward function, since it agrees well with human judgement. |