Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

Authors: Mingyu Ding, Zhenfang Chen, Tao Du, Ping Luo, Josh Tenenbaum, Chuang Gan

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental More accurate dynamics prediction in learned physics models enables state-of-the-art performance on both synthetic and real-world benchmarks while still maintaining high transparency and interpretability; most notably, VRDP improves the accuracy of predictive and counterfactual questions by 4.5% and 11.5% compared to its best counterpart.
Researcher Affiliation Collaboration Mingyu Ding MIT CSAIL and HKU Zhenfang Chen MIT-IBM Watson AI Lab Tao Du MIT CSAIL Ping Luo HKU Joshua B. Tenenbaum MIT BCS, CBMM, CSAIL Chuang Gan MIT-IBM Watson AI Lab
Pseudocode No No structured pseudocode or algorithm blocks were found in the paper. The model details and training objectives are described in narrative text and mathematical equations.
Open Source Code Yes Project page: http://vrdp.csail.mit.edu/
Open Datasets Yes To validate the effectiveness of our method for reasoning about the physical world, we conduct main experiments on the CLEVRER [79] dataset, as it contains both language and physics cues such as rigid body collisions and dynamics... For real-world scenarios, we conduct experiments on the Real-Billiard [63] dataset, which contains three-cushion billiards videos captured in real games for dynamics prediction.
Dataset Splits Yes We collect a few-shot physical reasoning dataset with novel language and physical concepts (e.g., heavier and lighter ), termed generalized CLEVRER, containing 100 videos (split into 25/25/50 for train/validation/test) with 375 options in 158 counterfactual questions.
Hardware Specification No No specific hardware details (such as GPU or CPU models, memory, or detailed computer specifications) used for running the experiments are provided in the paper.
Software Dependencies No The paper mentions using a 'pre-trained Faster R-CNN model [31]' and 'Slot-Attention model [57]' and refers to an 'impulse-based differentiable rigid-body simulator [37, 62, 14]', but does not provide specific version numbers for these software components or any other ancillary software dependencies.
Experiment Setup Yes We set t = 0.004s, K = 10, S = 10, and T = 128 for CLEVRER [79] and T = 20 for Real-Billiard [63]. More details of the dataset and settings can be found in Supplemental Materials.