Does Visual Pretraining Help End-to-End Reasoning?

Authors: Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform evaluation on two visual reasoning benchmarks, CATER and ACRE. We observe that pretraining is essential to achieve compositional generalization for end-to-end visual reasoning. Our proposed framework outperforms traditional supervised pretraining, including image classification and explicit object detection, by large margins.
Researcher Affiliation Collaboration Chen Sun Brown University, Google Calvin Luo Brown University Xingyi Zhou Google Research Anurag Arnab Google Research Cordelia Schmid Google Research
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Code and checkpoints will be released.
Open Datasets Yes We use the unlabeled videos from the training and validation splits of the CATER dataset for pretraining. ... For classification, we consider the same Vi T-B visual encoder trained on Image Net-21K [15] and JFT [52]. ... For object detection, we consider an in-domain object detection benchmark dataset called LA-CATER [49]. ... We validated the correctness of our object detector on the COCO benchmark... We explore generalization to a visually different reasoning benchmark, RAVEN [65]. ... We consider the Something-Else benchmark [43]
Dataset Splits Yes For CATER, we evaluate on the static split which has 3,065 training, 768 validation, and 1645 test examples. ... For ACRE, we explore all three splits, all of which contain 24,000 training, 8,000 validation, and 8,000 test examples.
Hardware Specification Yes All experiments are performed on TPU with 32 cores.
Software Dependencies No The paper mentions using Adam and AdamW optimizers but does not specify software versions for libraries, frameworks (e.g., TensorFlow, PyTorch), or programming languages.
Experiment Setup Yes We use the Adam optimizer for pretraining with a learning rate of 10 3, and the Adam W optimizer for transfer learning with a learning rate of 5 10 5. The pretraining checkpoints are trained from scratch for 1,000 epochs using a batch size of 256. For transfer learning, we finetune the pretrained checkpoints for 500 epochs using a batch size of 512.