TOA: Task-oriented Active VQA
Authors: xiaoying xing, Mingfu Liang, Ying Wu
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments show that our proposed method outperforms the baselines on open-ended knowledge-based VQA datasets and presents clear reasoning procedure with better interpretability. |
| Researcher Affiliation | Academia | Xiaoying Xing Mingfu Liang Ying Wu Northwestern University Department of Electrical and Computer Engineering {xiaoyingxing2026, mingfuliang2020}@u.northwestern.edu yingwu@northwestern.edu |
| Pseudocode | Yes | The instruction for leveraging the available vision models is Python-like pseudo code that describes the vision function, inspired by the API design in Viper GPT [20]. In this way, the scheduler can better understand the function of the vision models, and it facilitates the scheduler to call the required vision functions. An example is shown as follows: def filter(_object:str , _property:str)->bool: ''' presupposes the existence of _object. ''' return True if _object possesses the _property else False. |
| Open Source Code | No | The paper mentions using a third-party API (gpt-3.5-turbo-1) but does not provide any statement or link for the open-sourcing of their own method's code. |
| Open Datasets | Yes | We mainly evaluate our proposed method on knowledge-based VQA dataset OK-VQA [1] and conduct several experiments on A-OKVQA [2] as supplementary. OK-VQA is a commonly used knowledge-based VQA dataset that contains 14,055 image-question pairs associated with 14,031 images from MSCOCO dataset [46]. |
| Dataset Splits | Yes | We conduct our experiments on the validation set. |
| Hardware Specification | No | The paper describes the software models and APIs used (e.g., gpt-3.5-turbo-1, Grounding DINO, BLIP2, CLIP) but does not provide any specific details about the hardware (CPU, GPU, cloud instances, etc.) on which the experiments were conducted. |
| Software Dependencies | Yes | In the experiments, we implement the task scheduler using gpt-3.5-turbo-1. For the vision executor, we implement the spatial functions using Grounding DINO [47], which is an open-set object detector. The attribute functions are implemented by vision-language pre-training models BLIP2 [43] and X-VLM [48]. The representative in-context examples are selected using agglomerative clustering [49] with cosine metrics, and the feature embeddings are extracted by CLIP [41]. |
| Experiment Setup | Yes | We prompt the scheduler with 16 in-context examples in the experiment. The representative in-context examples are selected using agglomerative clustering [49] with cosine metrics, and the feature embeddings are extracted by CLIP [41]. In the inference stage, we follow the in-context example selection approach in [11]. For each input data, we compute its cosine similarities with the available examples in the embedding space and select the top k examples with the highest similarities. |