Cycle-Consistency Learning for Captioning and Grounding
Authors: Ning Wang, Jiajun Deng, Mingbo Jia
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our fully supervised grounding model achieves state-of-the-art performance, and the semi-weakly supervised one also exhibits competitive performance compared to the fully supervised counterparts. |
| Researcher Affiliation | Collaboration | Ning Wang1, Jiajun Deng2, Mingbo Jia1 1Huawei Inc. 2University of Adelaide, Australian Institute for Machine Learning |
| Pseudocode | No | The paper describes its methods but does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | No | The paper does not include an unambiguous statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | In the pre-training stage, we collect the image-text pairs from Visual Genome (Krishna et al. 2017), COCO (Lin et al. 2014), SBU (Ordonez, Kulkarni, and Berg 2011), Conceptual 3M (Sharma et al. 2018), and a filtered version of LAION (115M images) (Schuhmann et al. 2021). Ref COCO (Yu et al. 2016), Ref COCO+ (Yu et al. 2016), and Ref COCOg (Mao et al. 2016). |
| Dataset Splits | Yes | Following the official setting, Ref COCO and Ref COCO+ are split into the train set, validation set, test A set, and test B set. Ref COCOg includes the train set, validation set, and test set. |
| Hardware Specification | Yes | In the pre-training stage, the model is trained on 32 V100 GPUs for 20 epochs using a batch size of 2880. |
| Software Dependencies | No | The paper mentions software like Vi T-B/16, BERTbase, and the Adam W optimizer, but does not specify their version numbers (e.g., 'PyTorch 1.9' or 'TensorFlow 2.x'). |
| Experiment Setup | Yes | In the pre-training stage, the model is trained on 32 V100 GPUs for 20 epochs using a batch size of 2880. We use Adam W optimizer (Loshchilov and Hutter 2017) with a weight decay of 0.05. The learning rate is warmed-up to 3 10 4 and decayed linearly with a rate of 0.85. We take random image crops of resolution 224 224 during pre-training. In the fine-tuning stage, we train the model using a small learning rate of 1 10 5 and linearly decay it. For fair comparisons, following (Deng et al. 2021; Li et al. 2022), the input image resolutions are set to 640 640 and 384 384 when evaluating grounding and captioning tasks, respectively. The captioning model adopts the beam search strategy (beam size = 3) in all experiments. The proposed cycle-consistency model is fine-tuned for 20 epochs. |