PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning
Authors: Yining Hong, Li Yi, Josh Tenenbaum, Antonio Torralba, Chuang Gan
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We examine several state-of-the-art visual reasoning models on this dataset and observe that they still make many surprising mistakes in situations where humans can easily infer the correct answer. We analyze a suite of state-of-the-art visual reasoning models on the PTR dataset and find that they all struggle with it, especially in relational, analogical, and physical reasoning. |
| Researcher Affiliation | Collaboration | Yining Hong UCLA Li Yi Stanford University Joshua B. Tenenbaum MIT BCS, CBMM, CSAIL Antonio Torralba MIT CSAIL Chuang Gan MIT-IBM Watson AI Lab |
| Pseudocode | No | The paper does not include any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | PTR dataset and baseline models are publicly available 2. Project page: http://ptr.csail.mit.edu/ |
| Open Datasets | Yes | Therefore, to better serve for part-based conceptual, relational and physical reasoning, we introduce a new large-scale diagnostic visual reasoning dataset named PTR. PTR dataset and baseline models are publicly available 2. Project page: http://ptr.csail.mit.edu/ |
| Dataset Splits | Yes | PTR includes approximately 52k images for training, 9k for validation and 10k for testing. The images are rendered via Blender3. ... PTR contains approximately 520k questions for training, 90k for validation and 100k for testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions software like 'Blender3' and 'Bullet4' but does not provide specific version numbers for these or any other ancillary software components, which is required for reproducibility. |
| Experiment Setup | Yes | Implementation Details We use an Image Net-pretrained Res Net-101 to extract 14 14 1024 feature maps for MAC, MAC(P) and LCGN. For CNN-LSTM, we use the 2048-dimensional feature from the last pooling layer. The setup of MDETR is the same as the original paper with Res Net-101 as backbone. We first train only the task of part detection for 30 epochs, and then train the full PTR with question answering loss. For NS-VQA, we use Mask R-CNN [18] to generate segmentation proposals of objects and parts, respectively. The Mask R-CNN is trained on 20% of the training data annotated with ground-truth masks for 30,000 iterations. We do not include labels of categories and attributes when training segmentation. We extract the categories and attributes of objects and parts using attribute networks (Res Net-34). |