reproducibilityindex.ai

Cycle-Consistency Learning for Captioning and Grounding

Authors: Ning Wang, Jiajun Deng, Mingbo Jia

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that our fully supervised grounding model achieves state-of-the-art performance, and the semi-weakly supervised one also exhibits competitive performance compared to the fully supervised counterparts.
Researcher Affiliation	Collaboration	Ning Wang1, Jiajun Deng2, Mingbo Jia1 1Huawei Inc. 2University of Adelaide, Australian Institute for Machine Learning
Pseudocode	No	The paper describes its methods but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	No	The paper does not include an unambiguous statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	In the pre-training stage, we collect the image-text pairs from Visual Genome (Krishna et al. 2017), COCO (Lin et al. 2014), SBU (Ordonez, Kulkarni, and Berg 2011), Conceptual 3M (Sharma et al. 2018), and a ﬁltered version of LAION (115M images) (Schuhmann et al. 2021). Ref COCO (Yu et al. 2016), Ref COCO+ (Yu et al. 2016), and Ref COCOg (Mao et al. 2016).
Dataset Splits	Yes	Following the ofﬁcial setting, Ref COCO and Ref COCO+ are split into the train set, validation set, test A set, and test B set. Ref COCOg includes the train set, validation set, and test set.
Hardware Specification	Yes	In the pre-training stage, the model is trained on 32 V100 GPUs for 20 epochs using a batch size of 2880.
Software Dependencies	No	The paper mentions software like Vi T-B/16, BERTbase, and the Adam W optimizer, but does not specify their version numbers (e.g., 'PyTorch 1.9' or 'TensorFlow 2.x').
Experiment Setup	Yes	In the pre-training stage, the model is trained on 32 V100 GPUs for 20 epochs using a batch size of 2880. We use Adam W optimizer (Loshchilov and Hutter 2017) with a weight decay of 0.05. The learning rate is warmed-up to 3 10 4 and decayed linearly with a rate of 0.85. We take random image crops of resolution 224 224 during pre-training. In the ﬁne-tuning stage, we train the model using a small learning rate of 1 10 5 and linearly decay it. For fair comparisons, following (Deng et al. 2021; Li et al. 2022), the input image resolutions are set to 640 640 and 384 384 when evaluating grounding and captioning tasks, respectively. The captioning model adopts the beam search strategy (beam size = 3) in all experiments. The proposed cycle-consistency model is ﬁne-tuned for 20 epochs.