OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling
Authors: Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, Changsheng Xu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method is validated in the REC, RES, and PG tasks with five widely used datasets, namely three REC/RES datasets (Ref COCO/+/g [101, 62]), as well as two PG datasets (Refer It Game [34] and Flickr30k Entities [68]). |
| Researcher Affiliation | Academia | Linhui Xiao1,2,3, Xiaoshan Yang1,2,3, Fang Peng1,2,3, Yaowei Wang2,4, Changsheng Xu1,2,3 1MAIS, Institute of Automation, Chinese Academy of Sciences 2Pengcheng Laboratory 3School of Artificial Intelligence, University of Chinese Academy of Sciences 4Harbin Institute of Technology (Shenzhen) |
| Pseudocode | Yes | Algorithm 1 Referring-aware Dynamic Masking |
| Open Source Code | Yes | Our code and models are available at https://github.com/linhuixiao/One Ref. |
| Open Datasets | Yes | Our method is validated in the REC, RES, and PG tasks with five widely used datasets, namely three REC/RES datasets (Ref COCO/+/g [101, 62]), as well as two PG datasets (Refer It Game [34] and Flickr30k Entities [68]). |
| Dataset Splits | Yes | Table 1: Comparison with latest So TA methods on the five datasets for REC/PG tasks with singledataset fine-tuning setting. We highlight best result of base model in red and bold for large model. Methods Venue Visual Language Ref COCO Ref COCO+ Ref COCOg Refer It Flickr Backbone Backbone val test A test B val test A test B val test test test |
| Hardware Specification | Yes | For MRef M pre-training, the base model took 15 hours on 32 NVIDIA A100 GPUs, while the large model took 50 hours on the same number of GPUs. As for REC/RES transfer fine-tuning training, it took an average of 3 hours for the base model and 8 hours for the large model to process one dataset on 8 A100 GPUs. |
| Software Dependencies | No | The framework and experiments in our study were conducted using Py Torch. (No version specified for PyTorch). For NLP parsing, it mentions using 'spa Cy' but without a version number. |
| Experiment Setup | Yes | The batch size for pre-training the base model and large model are (32, 8), while they are (32, 8) and (16, 6) for transferring to the REC and RES tasks, respectively. Our model is optimized end-to-end by using the Adam W optimizer and a cosine learning scheduler with an initial learning rate of 0.5 10 4 for 110 epochs during the pre-training stage. During REC/RES transfer stage, the learning rates is 0.3 10 4 with 20 epochs. |