A Neural Compositional Paradigm for Image Captioning

Authors: Bo Dai, Sanja Fidler, Dahua Lin

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental All experiments are conducted on MS-COCO [5] and Flickr30k [6]. ... We compare the quality of the generated captions on the offline test set of MS-COCO and the test set of Flickr30k, in terms of SPICE (SP) [34], CIDEr (CD) [35], BLEU-4 (B4) [36], ROUGE (RG) [37], and METEOR (MT) [38]. As shown in Table 1, among all methods, Comp Cap with predicted noun-phrases obtains the best results under the SPICE metric... An ablation study is also conducted on components of the proposed compositional paradigm, as shown in the last three rows of Table 1.
Researcher Affiliation Collaboration 1 CUHK-Sense Time Joint Lab, The Chinese University of Hong Kong 2 University of Toronto 3 Vector Institute 4 NVIDIA
Pseudocode No The paper describes procedures and includes figures (e.g., Figure 2 and 3) illustrating structures and formulas, but it does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not provide any explicit statement or link regarding the public availability of its source code.
Open Datasets Yes All experiments are conducted on MS-COCO [5] and Flickr30k [6].
Dataset Splits Yes We follow the splits in [31] for both datasets. ... When testing, for all methods we select parameters that obtain best performance on the validation set to generate captions.
Hardware Specification No The paper mentions using 'Res Net-152 [17] pretrained on Image Net [33] to extract image features,' but it does not specify any details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions using 'NLPtookit [32]' and 'Res Net-152 [17]' but does not provide specific version numbers for software components or libraries.
Experiment Setup Yes Specifically, we use Res Net-152 [17] pretrained on Image Net [33] to extract image features, where activations of the last convolutional and fully-connected layer are used respectively as the regional and global feature vectors. During training, we fix Res Net-152 without finetuning, and set the learning rate to be 0.0001 for all methods. ... Beam-search of size 3 is used for baselines. As for Comp Cap, we empirically select n = 7 noun-phrases with top scores to represent the input image... Beam-search of size 3 is used for pair selection, while no beam-search is used for connecting phrase selection.