3D-IntPhys: Towards More Generalized 3D-grounded Visual Intuitive Physics under Challenging Scenes

Authors: Haotian Xue, Antonio Torralba, Josh Tenenbaum, Dan Yamins, Yunzhu Li, Hsiao-Yu Tung

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We generate datasets including three challenging scenarios involving fluid, granular materials, and rigid objects in the simulation. We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space. We also show that once trained, our model can achieve strong generalization in complex scenarios under extrapolate settings. The experiment section aims to answer the following three questions. (1) How well can the visual inference module capture the content of the environment (i.e., can we use the learned representations to reconstruct the scene)? (2) How well does the proposed framework perform in scenes with objects of complicated physical properties (e.g., fluids, rigid and granular objects) compared to baselines without explicit 3D representations? (3) How well do the models generalize in extrapolate scenarios?
Researcher Affiliation Academia Haotian Xue1 Antonio Torralba 2 Joshua Tenenbaum2 Daniel Yamins3 Yunzhu Li 3, 4 Hsiao-Yu Tung2 1 Georgia Tech 2 MIT 3 Stanford Univeristy 4 UIUC
Pseudocode Yes Algorithm 1: Point-based Dynamics Predictor
Open Source Code Yes The code is released in https://github.com/xavihart/3D-Int Phys.
Open Datasets No We generated three simulated datasets using the physics simulator Nvidia Fle X [39]. Each of the datasets represents one specific kind of manipulation scenario, where a robot arm interacts with rigid, fluid, and granular objects (Figure 3). Our datasets are generated by the NVIDIA Flex simulator.
Dataset Splits No For the rest of the four settings, we randomly split them into train and test sets with a ratio of 0.8. We train the perception module using Adam optimizer with a learning rate of 1e 4, and we reduce the learning rate by 80% when the performance on the validation set has stopped improving for 3 epochs.
Hardware Specification Yes Training the perception module on a single scenario takes around 5 hours on one RTX-3090. It takes around 10 15 hours to train the dynamics model in one environment on one single RTX-3090. this was run in blocks, with block-size=1000, made up 4G of a V100 GPU.
Software Dependencies No The models are implemented in Py Torch.
Experiment Setup Yes We train the perception module using Adam optimizer with a learning rate of 1e 4, and we reduce the learning rate by 80% when the performance on the validation set has stopped improving for 3 epochs. We train the dynamics simulator using Adam optimizer with a learning rate of 1e 4, and we reduce the learning rate by 80% when the performance on the validation set has stopped improving for 3 epochs. The batch size is set to 4. We train the model for 20, 30, and 40 epochs for Fluid Pour, Fluid Cube Shake, and Granular Push, respectively.