Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning
Authors: Zhenfang Chen, Jiayuan Mao, Jiajun Wu, Kwan-Yee Kenneth Wong, Joshua B. Tenenbaum, Chuang Gan
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate DCL s performance on CLEVRER, a video reasoning benchmark that includes descriptive, explanatory, predictive, and counterfactual reasoning with a uniform language interface. DCL achieves state-of-the-art performance on all question categories and requires no scene supervision such as object properties and collision events. To further examine the grounding accuracy and transferability of the acquired concepts, we introduce two new benchmarks for video-text retrieval and spatial-temporal grounding and localization on the CLEVRER videos, namely CLEVRERRetrieval and CLEVRER-Grounding. |
| Researcher Affiliation | Collaboration | Zhenfang Chen The University of Hong Kong Jiayuan Mao MIT CSAIL Jiajun Wu Stanford University Kwan-Yee K. Wong The University of Hong Kong Joshua B. Tenenbaum MIT BCS, CBMM, CSAIL Chuang Gan MIT-IBM Watson AI Lab |
| Pseudocode | No | The paper describes its methods and algorithms in paragraph text and mathematical formulations (e.g., in Appendix C and D), but it does not provide formal pseudocode blocks or algorithms labeled as such. |
| Open Source Code | No | The paper mentions a "Project page: http://dcl.csail.mit.edu" but does not explicitly state that source code for the described methodology is available at this link or in supplementary materials. |
| Open Datasets | Yes | We evaluate DCL s performance on CLEVRER, a video reasoning benchmark... We further conduct experiments on a real block tower video dataset (Lerer et al., 2016)... |
| Dataset Splits | Yes | Our models for video question answering are trained on the training set, tuned on the validation set, and evaluated in the test set. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions using Adam for optimization and ResNet-34 for feature extraction, but it does not provide specific version numbers for these or other software libraries or dependencies used in the experiments. |
| Experiment Setup | Yes | All models are trained using Adam (Kingma & Ba, 2014) for 20 epochs and the learning rate is set to 10 4. We evenly sample 32 frames for each video. For the dynamic predictor, we set the time window size w, the propogation step L and dimension of hidden states to be 3, 2 and 512, respectively. The dimension of the word embedding and all the hidden states is set to 300 and 256, respectively. |