TOA: Task-oriented Active VQA

Authors: xiaoying xing, Mingfu Liang, Ying Wu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments show that our proposed method outperforms the baselines on open-ended knowledge-based VQA datasets and presents clear reasoning procedure with better interpretability.
Researcher Affiliation Academia Xiaoying Xing Mingfu Liang Ying Wu Northwestern University Department of Electrical and Computer Engineering {xiaoyingxing2026, mingfuliang2020}@u.northwestern.edu yingwu@northwestern.edu
Pseudocode Yes The instruction for leveraging the available vision models is Python-like pseudo code that describes the vision function, inspired by the API design in Viper GPT [20]. In this way, the scheduler can better understand the function of the vision models, and it facilitates the scheduler to call the required vision functions. An example is shown as follows: def filter(_object:str , _property:str)->bool: ''' presupposes the existence of _object. ''' return True if _object possesses the _property else False.
Open Source Code No The paper mentions using a third-party API (gpt-3.5-turbo-1) but does not provide any statement or link for the open-sourcing of their own method's code.
Open Datasets Yes We mainly evaluate our proposed method on knowledge-based VQA dataset OK-VQA [1] and conduct several experiments on A-OKVQA [2] as supplementary. OK-VQA is a commonly used knowledge-based VQA dataset that contains 14,055 image-question pairs associated with 14,031 images from MSCOCO dataset [46].
Dataset Splits Yes We conduct our experiments on the validation set.
Hardware Specification No The paper describes the software models and APIs used (e.g., gpt-3.5-turbo-1, Grounding DINO, BLIP2, CLIP) but does not provide any specific details about the hardware (CPU, GPU, cloud instances, etc.) on which the experiments were conducted.
Software Dependencies Yes In the experiments, we implement the task scheduler using gpt-3.5-turbo-1. For the vision executor, we implement the spatial functions using Grounding DINO [47], which is an open-set object detector. The attribute functions are implemented by vision-language pre-training models BLIP2 [43] and X-VLM [48]. The representative in-context examples are selected using agglomerative clustering [49] with cosine metrics, and the feature embeddings are extracted by CLIP [41].
Experiment Setup Yes We prompt the scheduler with 16 in-context examples in the experiment. The representative in-context examples are selected using agglomerative clustering [49] with cosine metrics, and the feature embeddings are extracted by CLIP [41]. In the inference stage, we follow the in-context example selection approach in [11]. For each input data, we compute its cosine similarities with the available examples in the embedding space and select the top k examples with the highest similarities.