Cycle-Consistency Learning for Captioning and Grounding

Authors: Ning Wang, Jiajun Deng, Mingbo Jia

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our fully supervised grounding model achieves state-of-the-art performance, and the semi-weakly supervised one also exhibits competitive performance compared to the fully supervised counterparts.
Researcher Affiliation Collaboration Ning Wang1, Jiajun Deng2, Mingbo Jia1 1Huawei Inc. 2University of Adelaide, Australian Institute for Machine Learning
Pseudocode No The paper describes its methods but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper does not include an unambiguous statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes In the pre-training stage, we collect the image-text pairs from Visual Genome (Krishna et al. 2017), COCO (Lin et al. 2014), SBU (Ordonez, Kulkarni, and Berg 2011), Conceptual 3M (Sharma et al. 2018), and a filtered version of LAION (115M images) (Schuhmann et al. 2021). Ref COCO (Yu et al. 2016), Ref COCO+ (Yu et al. 2016), and Ref COCOg (Mao et al. 2016).
Dataset Splits Yes Following the official setting, Ref COCO and Ref COCO+ are split into the train set, validation set, test A set, and test B set. Ref COCOg includes the train set, validation set, and test set.
Hardware Specification Yes In the pre-training stage, the model is trained on 32 V100 GPUs for 20 epochs using a batch size of 2880.
Software Dependencies No The paper mentions software like Vi T-B/16, BERTbase, and the Adam W optimizer, but does not specify their version numbers (e.g., 'PyTorch 1.9' or 'TensorFlow 2.x').
Experiment Setup Yes In the pre-training stage, the model is trained on 32 V100 GPUs for 20 epochs using a batch size of 2880. We use Adam W optimizer (Loshchilov and Hutter 2017) with a weight decay of 0.05. The learning rate is warmed-up to 3 10 4 and decayed linearly with a rate of 0.85. We take random image crops of resolution 224 224 during pre-training. In the fine-tuning stage, we train the model using a small learning rate of 1 10 5 and linearly decay it. For fair comparisons, following (Deng et al. 2021; Li et al. 2022), the input image resolutions are set to 640 640 and 384 384 when evaluating grounding and captioning tasks, respectively. The captioning model adopts the beam search strategy (beam size = 3) in all experiments. The proposed cycle-consistency model is fine-tuned for 20 epochs.