Interfacing Foundation Models' Embeddings
Authors: Xueyan Zou, Linjie Li, Jianfeng Wang, Jianwei Yang, Mingyu Ding, Junyi Wei, Zhengyuan Yang, Feng Li, Hao Zhang, Shilong Liu, Arul Aravinthan, Yong Jae Lee, Lijuan Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments Table 2: Benchmark on Generalizable multi-modal understanding tasks with one model architecture joint training for all. Datasets. We use COCO (25) as our main training and evaluation dataset, which spans diverse annotation types. |
| Researcher Affiliation | Collaboration | Xueyan Zou , Linjie Li , Jianfeng Wang , Jianwei Yang , Mingyu Ding , Junyi Wei Zhengyuan Yang , Feng Li , Hao Zhang , Shilong Liu&, Arul Aravinthan , Yong Jae Lee , Lijuan Wang UW-Madison Microsoft UC Berkeley HKUST & Tsinghua University |
| Pseudocode | Yes | Table 1: Pseudo code for Data Engine. We show the pipeline to create the FIND-Bench from data preparation, text prompting using GPT4, visual prompting with SEEM to integrated result. |
| Open Source Code | Yes | https://github.com/UX-Decoder/FIND, https://github.com/UX-Decoder/vlcore and 'Our code is public available.' (from NeurIPS Paper Checklist) |
| Open Datasets | Yes | Datasets. We use COCO (25) as our main training and evaluation dataset, which spans diverse annotation types. We make use of the annotations from COCO-panoptic, Ref-COCO (45; 28; 29), COCO-Karpathy (18), and the new datasets generated with the data engine in FIND-Bench. |
| Dataset Splits | No | The paper states using COCO for training and evaluation, and mentions the COCO validation set in Figure 1, but does not explicitly detail the training, validation, and test dataset splits with percentages or sample counts in the main text. The authors indicate these details are available in the public code. |
| Hardware Specification | No | The paper does not provide specific hardware specifications (e.g., GPU models, CPU types, memory) used for running the experiments. It only mentions model sizes and backbones. |
| Software Dependencies | No | The paper does not explicitly list software dependencies with specific version numbers (e.g., PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | Settings. We benchmark our method on three different model sizes: Tiny (Focal Net), Base (Davit-d3), and Large (Davit-d3)... The vision backbone is fixed and reuses the X-Decoder pre-trained weights unless specified as SAM. The language backbone is a fixed LLa MA-7B, unless specified as Uni CL. During training, we train the FIND-Interface jointly on all the tasks unless specified... In Table 2, models are either 384x384 with batch size 384 or 1024x1024 with batch size 192 for all tasks. Other tables show results with a 640x640 training resolution and a 192 batch size. |