Referring Transformer: A One-step Approach to Multi-task Visual Grounding

Authors: Muchen Li, Leonid Sigal

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments and ablations illustrate that our model benefits greatly from contextualized information and multi-task training.
Researcher Affiliation Academia Muchen Li1,2 Leonid Sigal1,2,3,4 muchenli@cs.ubc.ca lsigal@cs.ubc.ca 1Department of Computer Science, University of British Columbia 2Vector Institute for AI 3CIFAR AI Chair 4NSERC CRC Chair
Pseudocode No The paper describes its model architecture and components in detail within the text and diagrams, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code for the described methodology, nor does it include links to a code repository.
Open Datasets Yes Ref COCO/Ref COCO+/Ref COCOg [55] and Flickr30k Entities [41] and Refer It [24] and Visual-Genome dataset [25]
Dataset Splits Yes On Ref COCO and Ref COCO+ we follow the split used in [55] and report scores on the validation, test A and test B splits. On Ref COCOg, we use the Ref COCO-umd splits proposed in [40]. ... We use splits from [41, 42]. ... We follow setup in [3] for splitting train, validation and test set; resulting in 54k, 6k and 6k referring expressions respectively.
Hardware Specification Yes All experiments are conducted using 4 Nvidia 2080TI GPU with batch size as 32.
Software Dependencies No The paper mentions using Adam W, BERT model, Hugging Face checkpoints, ResNet, and YOLOv3, but does not provide specific version numbers for these software components or any other libraries/frameworks.
Experiment Setup Yes The initial learning rate is set to 1e-4 while the learning rate of image backbone and context encoder is set to 1e-5. We initialized weights in the transformer encoder and decoder with Xavier initialization [14]. For data augmentation, we scale images such that the longest side is 640 pixels... On Flickr30k dataset, we set the maxium length of context sentence to 90 and maximum number of referring phrases to 16... We set the maximum length of context sentence on these two datasets to 40. ... we train the model on pretraining dataset for 6 epoches. ... batch size as 32.