Interfacing Foundation Models' Embeddings

Authors: Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Junyi Wei, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments Table 2: Benchmark on Generalizable multi-modal understanding tasks with one model architecture joint training for all. Datasets. We use COCO (25) as our main training and evaluation dataset, which spans diverse annotation types.
Researcher Affiliation Collaboration Xueyan Zou , Linjie Li , Jianfeng Wang , Jianwei Yang , Mingyu Ding , Junyi Wei Zhengyuan Yang , Feng Li , Hao Zhang , Shilong Liu&, Arul Aravinthan , Yong Jae Lee , Lijuan Wang UW-Madison Microsoft UC Berkeley HKUST & Tsinghua University
Pseudocode Yes Table 1: Pseudo code for Data Engine. We show the pipeline to create the FIND-Bench from data preparation, text prompting using GPT4, visual prompting with SEEM to integrated result.
Open Source Code Yes https://github.com/UX-Decoder/FIND, https://github.com/UX-Decoder/vlcore and 'Our code is public available.' (from NeurIPS Paper Checklist)
Open Datasets Yes Datasets. We use COCO (25) as our main training and evaluation dataset, which spans diverse annotation types. We make use of the annotations from COCO-panoptic, Ref-COCO (45; 28; 29), COCO-Karpathy (18), and the new datasets generated with the data engine in FIND-Bench.
Dataset Splits No The paper states using COCO for training and evaluation, and mentions the COCO validation set in Figure 1, but does not explicitly detail the training, validation, and test dataset splits with percentages or sample counts in the main text. The authors indicate these details are available in the public code.
Hardware Specification No The paper does not provide specific hardware specifications (e.g., GPU models, CPU types, memory) used for running the experiments. It only mentions model sizes and backbones.
Software Dependencies No The paper does not explicitly list software dependencies with specific version numbers (e.g., PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes Settings. We benchmark our method on three different model sizes: Tiny (Focal Net), Base (Davit-d3), and Large (Davit-d3)... The vision backbone is fixed and reuses the X-Decoder pre-trained weights unless specified as SAM. The language backbone is a fixed LLa MA-7B, unless specified as Uni CL. During training, we train the FIND-Interface jointly on all the tasks unless specified... In Table 2, models are either 384x384 with batch size 384 or 1024x1024 with batch size 192 for all tasks. Other tables show results with a 640x640 training resolution and a 192 batch size.