Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ProTo: Program-Guided Transformer for Program-Guided Tasks

Authors: Zelin Zhao, Karan Samel, Binghong Chen, lee song

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that Pro To significantly outperforms the previous state-of-the-art methods on GQA visual reasoning and 2D Minecraft policy learning datasets. Additionally, Pro To demonstrates better generalization to unseen, complex, and human-written programs. We evaluate Pro To on two tasks, program-guided visual reasoning and program-guided policy learning (corresponding to Figure 1 left and Figure 1 right).
Researcher Affiliation Collaboration Zelin Zhao The Chinese University of Hong Kong EMAIL Karan Samel Georgia Institute of Technology EMAIL Binghong Chen Georgia Institute of Technology EMAIL Le Song Biomap and MBZUAI EMAIL
Pseudocode Yes Algorithm 1: Pro To Execution
Open Source Code No We will release the code and pre-trained models after publishing.
Open Datasets Yes We conduct experiments of program-guided visual reasoning based on the public GQA dataset [47] consisting of 22 million questions over 140 thousand images. It is divided into training, validation, and testing splits.
Dataset Splits Yes The GQA dataset [47] consisting of 22 million questions over 140 thousand images. It is divided into training, validation, and testing splits. On the training split, we train a transformer-based seq2seq model [87] to parse a question into a program. For validation and testing, we use this trained seq2seq model to acquire a program from a question.
Hardware Specification No The paper does not explicitly state specific hardware components (like GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies No The optimizer is BERT Adam optimizer [24] with a base learning rate 1 10 4, which is decayed by a factor of 0.5 every epoch. To alleviate over-fitting, we adopt an L2 weight decay of 0.01.
Experiment Setup Yes We take N = 50 object features (provided by the GQA dataset) with d = 2048 dimension. The optimizer is BERT Adam optimizer [24] with a base learning rate 1 10 4, which is decayed by a factor of 0.5 every epoch. To alleviate over-fitting, we adopt an L2 weight decay of 0.01. The model is trained for 20 epochs on the training split, and the best model evaluated on the validation split is submitted to the public evaluation server to get testing results.