Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

Authors: Shantanu Jaiswal, Debaditya Roy, Basura Fernando, Cheston Tan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate IPRM on STAR[79], AGQAv2[20] and CLEVRER-Humans[51] for video reasoning tasks and CLEVR-Humans[35], GQA [29] and CLEVR-Co Gen T [34] for image reasoning tasks. For all tasks, we set IPRM s parallel operations (Nop) to 6, reasoning steps (T) to 9, reduction ratio (r) to 2 and window length (W) to 2 (informed by ablative analysis detailed in sec. 3.3). We also performed quantiative ablations to study individual impacts of parallel and iterative computation besides qualitative analysis of IPRM s reasoning computation visualization.
Researcher Affiliation Collaboration Shantanu Jaiswal1,2 Debaditya Roy2 Basura Fernando2,3 Cheston Tan2,3 1 Carnegie Mellon University 2 IHPC, A*STAR Singapore 3 Centre for Frontier AI Research, A*STAR Singapore
Pseudocode Yes We provide Python-style pseudocode of IPRM in figs 12, 13 and 14.
Open Source Code Yes Source code at: https://github.com/shantanuj/IPRM_Iterative_ and_Parallel_Reasoning_Mechanism
Open Datasets Yes We evaluate IPRM on STAR[79], AGQAv2[20] and CLEVRER-Humans[51] for video reasoning tasks and CLEVR-Humans[35], GQA [29] and CLEVR-Co Gen T [34] for image reasoning tasks.
Dataset Splits Yes For CLEVR-Humans... Each ablation model is first pretrained for 10 epochs on the original CLEVR dataset... and then finetuned on CLEVR-Humans for 40 epochs with early stopping (learning rate of 1e-4 throughout). For CLEVR-Cogen T... we trained our model on condition A for 40 epochs (with early stopping) and used the best cond. A validation performance model to evaluate generalization performance on cond.B. For finetuning on cond.B we finetuned the best cond.A model for 20 epochs and used the best cond.B validation performance model to also evaluate on cond.A.
Hardware Specification Yes All experiments are performed on a single NVIDIA A40 GPU with 46GB memory and averaged over 3 trials with different random seeds wherever possible
Software Dependencies No We implement IPRM in Py Torch [58] as a generic vision-language module... For CLIP [61], we utilize the official models from Huggingface [77]. We use the same language encoder (Distil-Roberta[48] from Huggingface[78]) as in existing state-of-art MDETR [36] and frozen Res Net101 backbone layer 3 spatial features (as in [28, 52, 35]).
Experiment Setup Yes For all experiments, we set the internal dimension of IPRM to 512 and use the same configuration of num. parallel operations (Nop)=6, num. computation steps (T)=9, reduction ratio (r)=2 and window size (W)=2. Unless otherwise specified, the learning rate is initialized to 1e-4 with Adam [38] optimizer and gradient clipping value of 8. The learning-rate is reduced based on validation acc. plateau with reduction factor 0.5, threshold 0.001 and patience 0.