A Neural Compositional Paradigm for Image Captioning
Authors: Bo Dai, Sanja Fidler, Dahua Lin
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | All experiments are conducted on MS-COCO [5] and Flickr30k [6]. ... We compare the quality of the generated captions on the offline test set of MS-COCO and the test set of Flickr30k, in terms of SPICE (SP) [34], CIDEr (CD) [35], BLEU-4 (B4) [36], ROUGE (RG) [37], and METEOR (MT) [38]. As shown in Table 1, among all methods, Comp Cap with predicted noun-phrases obtains the best results under the SPICE metric... An ablation study is also conducted on components of the proposed compositional paradigm, as shown in the last three rows of Table 1. |
| Researcher Affiliation | Collaboration | 1 CUHK-Sense Time Joint Lab, The Chinese University of Hong Kong 2 University of Toronto 3 Vector Institute 4 NVIDIA |
| Pseudocode | No | The paper describes procedures and includes figures (e.g., Figure 2 and 3) illustrating structures and formulas, but it does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper does not provide any explicit statement or link regarding the public availability of its source code. |
| Open Datasets | Yes | All experiments are conducted on MS-COCO [5] and Flickr30k [6]. |
| Dataset Splits | Yes | We follow the splits in [31] for both datasets. ... When testing, for all methods we select parameters that obtain best performance on the validation set to generate captions. |
| Hardware Specification | No | The paper mentions using 'Res Net-152 [17] pretrained on Image Net [33] to extract image features,' but it does not specify any details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'NLPtookit [32]' and 'Res Net-152 [17]' but does not provide specific version numbers for software components or libraries. |
| Experiment Setup | Yes | Specifically, we use Res Net-152 [17] pretrained on Image Net [33] to extract image features, where activations of the last convolutional and fully-connected layer are used respectively as the regional and global feature vectors. During training, we fix Res Net-152 without finetuning, and set the learning rate to be 0.0001 for all methods. ... Beam-search of size 3 is used for baselines. As for Comp Cap, we empirically select n = 7 noun-phrases with top scores to represent the input image... Beam-search of size 3 is used for pair selection, while no beam-search is used for connecting phrase selection. |