reproducibilityindex.ai

TOA: Task-oriented Active VQA

Authors: xiaoying xing, Mingfu Liang, Ying Wu

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments show that our proposed method outperforms the baselines on open-ended knowledge-based VQA datasets and presents clear reasoning procedure with better interpretability.
Researcher Affiliation	Academia	Xiaoying Xing Mingfu Liang Ying Wu Northwestern University Department of Electrical and Computer Engineering {xiaoyingxing2026, mingfuliang2020}@u.northwestern.edu yingwu@northwestern.edu
Pseudocode	Yes	The instruction for leveraging the available vision models is Python-like pseudo code that describes the vision function, inspired by the API design in Viper GPT [20]. In this way, the scheduler can better understand the function of the vision models, and it facilitates the scheduler to call the required vision functions. An example is shown as follows: def filter(_object:str , _property:str)->bool: ''' presupposes the existence of _object. ''' return True if _object possesses the _property else False.
Open Source Code	No	The paper mentions using a third-party API (gpt-3.5-turbo-1) but does not provide any statement or link for the open-sourcing of their own method's code.
Open Datasets	Yes	We mainly evaluate our proposed method on knowledge-based VQA dataset OK-VQA [1] and conduct several experiments on A-OKVQA [2] as supplementary. OK-VQA is a commonly used knowledge-based VQA dataset that contains 14,055 image-question pairs associated with 14,031 images from MSCOCO dataset [46].
Dataset Splits	Yes	We conduct our experiments on the validation set.
Hardware Specification	No	The paper describes the software models and APIs used (e.g., gpt-3.5-turbo-1, Grounding DINO, BLIP2, CLIP) but does not provide any specific details about the hardware (CPU, GPU, cloud instances, etc.) on which the experiments were conducted.
Software Dependencies	Yes	In the experiments, we implement the task scheduler using gpt-3.5-turbo-1. For the vision executor, we implement the spatial functions using Grounding DINO [47], which is an open-set object detector. The attribute functions are implemented by vision-language pre-training models BLIP2 [43] and X-VLM [48]. The representative in-context examples are selected using agglomerative clustering [49] with cosine metrics, and the feature embeddings are extracted by CLIP [41].
Experiment Setup	Yes	We prompt the scheduler with 16 in-context examples in the experiment. The representative in-context examples are selected using agglomerative clustering [49] with cosine metrics, and the feature embeddings are extracted by CLIP [41]. In the inference stage, we follow the in-context example selection approach in [11]. For each input data, we compute its cosine similarities with the available examples in the embedding space and select the top k examples with the highest similarities.