Referring Expression Comprehension Using Language Adaptive Inference

Authors: Wei Su, Peihan Miao, Huanzhang Dou, Yongjian Fu, Xi Li

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on Ref COCO, Ref COCO+, Ref COCOg, and Referit show that the proposed method achieves faster inference speed and higher accuracy against state-of-the-art approaches.
Researcher Affiliation Academia 1College of Computer Science & Technology, Zhejiang University 2School of Software Technology, Zhejiang University 3Shanghai Institute for Advanced Study, Zhejiang University 4Shanghai AI Laboratory {weisuzju, peihan.miao, hzdou, yjfu, xilizju}@zju.edu.cn
Pseudocode No The paper describes the architecture and processes of the LADS framework, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about the availability of its source code or a link to a code repository.
Open Datasets Yes Ref COCO (Yu et al. 2016), Ref COCO+ (Yu et al. 2016), Ref COCOg (Mao et al. 2016), Referit (Kazemzadeh et al. 2014) and large-scale pretraining dataset.
Dataset Splits Yes Ref COCO and Ref COCO+, which are officially split into train, val, test A, and test B sets, have 19,994 images with 142,210 referring expressions and 19,992 images with 141,564 referring expressions, respectively. Ref COCOg (Nagaraja, Morariu, and Davis 2016) has 25,799 images with 95,010 referring expressions, which is officially split into train, val, and test sets. Referit has 20,000 images collected from SAIAPR-12 (Escalante et al. 2010), which is split into train and test sets.
Hardware Specification Yes All models are trained on the NVIDIA A100 GPU with CUDA 11.4.
Software Dependencies Yes All models are trained on the NVIDIA A100 GPU with CUDA 11.4. For the linguistic backbone, we use the first six layers of BERT (Devlin et al. 2018) provided by Hugging Face (Wolf et al. 2020).
Experiment Setup Yes The input images are resized to 512 × 512, and the max expression length is 40. All models are end-to-end optimized by Adam W (Loshchilov and Hutter 2020) optimizer with weight decay of 1e-4. The initial learning rate of visual backbone and linguistic backbone is 1e-5, and the initial learning rate of the rest is 1e-4. We train for 120 epochs with a batch size of 256, where the learning rate is reduced by 10 after 90 epochs. In large-scale pre-training and finetuning, we train for 40 and 20 epochs with batch sizes of 512 and 256, where the learning rate is reduced by 10 after 30 and 10 epochs, respectively.