Does Visual Pretraining Help End-to-End Reasoning?
Authors: Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform evaluation on two visual reasoning benchmarks, CATER and ACRE. We observe that pretraining is essential to achieve compositional generalization for end-to-end visual reasoning. Our proposed framework outperforms traditional supervised pretraining, including image classification and explicit object detection, by large margins. |
| Researcher Affiliation | Collaboration | Chen Sun Brown University, Google Calvin Luo Brown University Xingyi Zhou Google Research Anurag Arnab Google Research Cordelia Schmid Google Research |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Code and checkpoints will be released. |
| Open Datasets | Yes | We use the unlabeled videos from the training and validation splits of the CATER dataset for pretraining. ... For classification, we consider the same Vi T-B visual encoder trained on Image Net-21K [15] and JFT [52]. ... For object detection, we consider an in-domain object detection benchmark dataset called LA-CATER [49]. ... We validated the correctness of our object detector on the COCO benchmark... We explore generalization to a visually different reasoning benchmark, RAVEN [65]. ... We consider the Something-Else benchmark [43] |
| Dataset Splits | Yes | For CATER, we evaluate on the static split which has 3,065 training, 768 validation, and 1645 test examples. ... For ACRE, we explore all three splits, all of which contain 24,000 training, 8,000 validation, and 8,000 test examples. |
| Hardware Specification | Yes | All experiments are performed on TPU with 32 cores. |
| Software Dependencies | No | The paper mentions using Adam and AdamW optimizers but does not specify software versions for libraries, frameworks (e.g., TensorFlow, PyTorch), or programming languages. |
| Experiment Setup | Yes | We use the Adam optimizer for pretraining with a learning rate of 10 3, and the Adam W optimizer for transfer learning with a learning rate of 5 10 5. The pretraining checkpoints are trained from scratch for 1,000 epochs using a batch size of 256. For transfer learning, we finetune the pretrained checkpoints for 500 epochs using a batch size of 512. |