Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning

Authors: Zhenfang Chen, Jiayuan Mao, Jiajun Wu, Kwan-Yee Kenneth Wong, Joshua B. Tenenbaum, Chuang Gan

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate DCL s performance on CLEVRER, a video reasoning benchmark that includes descriptive, explanatory, predictive, and counterfactual reasoning with a uniform language interface. DCL achieves state-of-the-art performance on all question categories and requires no scene supervision such as object properties and collision events. To further examine the grounding accuracy and transferability of the acquired concepts, we introduce two new benchmarks for video-text retrieval and spatial-temporal grounding and localization on the CLEVRER videos, namely CLEVRERRetrieval and CLEVRER-Grounding.
Researcher Affiliation Collaboration Zhenfang Chen The University of Hong Kong Jiayuan Mao MIT CSAIL Jiajun Wu Stanford University Kwan-Yee K. Wong The University of Hong Kong Joshua B. Tenenbaum MIT BCS, CBMM, CSAIL Chuang Gan MIT-IBM Watson AI Lab
Pseudocode No The paper describes its methods and algorithms in paragraph text and mathematical formulations (e.g., in Appendix C and D), but it does not provide formal pseudocode blocks or algorithms labeled as such.
Open Source Code No The paper mentions a "Project page: http://dcl.csail.mit.edu" but does not explicitly state that source code for the described methodology is available at this link or in supplementary materials.
Open Datasets Yes We evaluate DCL s performance on CLEVRER, a video reasoning benchmark... We further conduct experiments on a real block tower video dataset (Lerer et al., 2016)...
Dataset Splits Yes Our models for video question answering are trained on the training set, tuned on the validation set, and evaluated in the test set.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions using Adam for optimization and ResNet-34 for feature extraction, but it does not provide specific version numbers for these or other software libraries or dependencies used in the experiments.
Experiment Setup Yes All models are trained using Adam (Kingma & Ba, 2014) for 20 epochs and the learning rate is set to 10 4. We evenly sample 32 frames for each video. For the dynamic predictor, we set the time window size w, the propogation step L and dimension of hidden states to be 3, 2 and 512, respectively. The dimension of the word embedding and all the hidden states is set to 300 and 256, respectively.