reproducibilityindex.ai

A Neural Compositional Paradigm for Image Captioning

Authors: Bo Dai, Sanja Fidler, Dahua Lin

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	All experiments are conducted on MS-COCO [5] and Flickr30k [6]. ... We compare the quality of the generated captions on the ofﬂine test set of MS-COCO and the test set of Flickr30k, in terms of SPICE (SP) [34], CIDEr (CD) [35], BLEU-4 (B4) [36], ROUGE (RG) [37], and METEOR (MT) [38]. As shown in Table 1, among all methods, Comp Cap with predicted noun-phrases obtains the best results under the SPICE metric... An ablation study is also conducted on components of the proposed compositional paradigm, as shown in the last three rows of Table 1.
Researcher Affiliation	Collaboration	1 CUHK-Sense Time Joint Lab, The Chinese University of Hong Kong 2 University of Toronto 3 Vector Institute 4 NVIDIA
Pseudocode	No	The paper describes procedures and includes figures (e.g., Figure 2 and 3) illustrating structures and formulas, but it does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper does not provide any explicit statement or link regarding the public availability of its source code.
Open Datasets	Yes	All experiments are conducted on MS-COCO [5] and Flickr30k [6].
Dataset Splits	Yes	We follow the splits in [31] for both datasets. ... When testing, for all methods we select parameters that obtain best performance on the validation set to generate captions.
Hardware Specification	No	The paper mentions using 'Res Net-152 [17] pretrained on Image Net [33] to extract image features,' but it does not specify any details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using 'NLPtookit [32]' and 'Res Net-152 [17]' but does not provide specific version numbers for software components or libraries.
Experiment Setup	Yes	Speciﬁcally, we use Res Net-152 [17] pretrained on Image Net [33] to extract image features, where activations of the last convolutional and fully-connected layer are used respectively as the regional and global feature vectors. During training, we ﬁx Res Net-152 without ﬁnetuning, and set the learning rate to be 0.0001 for all methods. ... Beam-search of size 3 is used for baselines. As for Comp Cap, we empirically select n = 7 noun-phrases with top scores to represent the input image... Beam-search of size 3 is used for pair selection, while no beam-search is used for connecting phrase selection.